Recommending root-cause and mitigation steps for cloud incidents using large language models

T Ahmed, S Ghosh, C Bansal… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Incident management for cloud services is a complex process involving several steps and
has a huge impact on both service health and developer productivity. On-call engineers …

Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection

Y Chen, C Zhang, M Ma, Y Liu, R Ding, B Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Anomaly detection in multivariate time series data is of paramount importance for ensuring
the efficient operation of large-scale systems across diverse domains. However, accurately …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

Assess and summarize: Improve outage understanding with large language models

P Jin, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - jun-zeng.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Z Wang, Z Liu, Y Zhang, A Zhong, L Fan, L Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language model (LLM) applications in cloud root cause analysis (RCA) have been
actively explored recently. However, current methods are still reliant on manual workflow …

Detection is better than cure: A cloud incidents perspective

V Ganatra, A Parayil, S Ghosh, Y Kang, M Ma… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud providers use automated watchdogs or monitors to continuously observe service
availability and to proactively report incidents when system performance degrades. Improper …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

X Zhang, S Ghosh, C Bansal, R Wang, M Ma… - … Proceedings of the …, 2024 - dl.acm.org
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud
services, requiring on-call engineers to identify the primary issues and implement corrective …