[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org
Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

What bugs cause production cloud incidents?

H Liu, S Lu, M Musuvathi, S Nath - Proceedings of the Workshop on Hot …, 2019 - dl.acm.org
Cloud services have become the backbone of today's computing world. Runtime incidents,
which adversely affect the expected service operations, are extremely costly in terms of user …

Cloud datacenter workload estimation using error preventive time series forecasting models

J Kumar, AK Singh - Cluster Computing, 2020 - Springer
The workload estimation plays a vital role in efficient management of cloud resources. This
paper introduces the error preventive score (EPS) in time series forecasting models to …

Evaluating scalability bottlenecks by workload extrapolation

R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org
Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …

Sliding {Look-Back} Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance

Z Cao, S Liu, F Wu, G Wang, B Li, DHC Du - 17th USENIX Conference …, 2019 - usenix.org
Data deduplication is an effective way of improving storage space utilization. The data
generated by deduplication is persistently stored in data chunks or data containers (a …

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

W Zhang, H Guo, J Yang, Y Zhang, C Yan… - arXiv preprint arXiv …, 2024 - arxiv.org
The escalating complexity of micro-services architecture in cloud-native technologies poses
significant challenges for maintaining system stability and efficiency. To conduct root cause …

Pbse: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks

RO Suminto, CA Stuardo, A Clark, H Ke… - Proceedings of the …, 2017 - dl.acm.org
We reveal loopholes of Speculative Execution (SE) implementations under a unique fault
model: node-level network throughput degradation. This problem appears in many data …