Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Understanding and discovering software configuration dependencies in cloud and datacenter systems

Q Chen, T Wang, O Legunsen, S Li, T Xu - … of the 28th ACM Joint Meeting …, 2020 - dl.acm.org
A large percentage of real-world software configuration issues, such as misconfigurations,
involve multiple interdependent configuration parameters. However, existing techniques and …

Configuration validation with large language models

X Lian, Y Chen, R Cheng, J Huang, P Thakkar… - arXiv preprint arXiv …, 2023 - arxiv.org
Misconfigurations are the major causes of software failures. Existing configuration validation
techniques rely on manually written rules or test cases, which are expensive to implement …

Acto: Automatic end-to-end testing for operation correctness of cloud system management

JT Gu, X Sun, W Zhang, Y Jiang, C Wang… - Proceedings of the 29th …, 2023 - dl.acm.org
Cloud systems are increasingly being managed by operation programs termed operators,
which automate tedious, human-based operations. Operators of modern management …

Static detection of silent misconfigurations with deep interaction analysis

J Zhang, R Piskac, E Zhai, T Xu - Proceedings of the ACM on …, 2021 - dl.acm.org
The behavior of large systems is guided by their configurations: users set parameters in the
configuration file to dictate which corresponding part of the system code is executed …

An evolutionary study of configuration design and implementation in cloud systems

Y Zhang, H He, O Legunsen, S Li… - 2021 IEEE/ACM 43rd …, 2021 - ieeexplore.ieee.org
Many techniques were proposed for detecting software misconfigurations in cloud systems
and for diagnosing unintended behavior caused by such misconfigurations. Detection and …

Auric: using data-driven recommendation to automatically generate cellular configuration

A Mahimkar, A Sivakumar, Z Ge, S Pathak… - Proceedings of the 2021 …, 2021 - dl.acm.org
Cellular service providers add carriers in the network in order to support the increasing
demand in voice and data traffic and provide good quality of service to the users. Addition of …

Test-case prioritization for configuration testing

R Cheng, L Zhang, D Marinov, T Xu - Proceedings of the 30th ACM …, 2021 - dl.acm.org
Configuration changes are among the dominant causes of failures of large-scale software
system deployment. Given the velocity of configuration changes, typically at the scale of …

Fail through the cracks: Cross-system interaction failures in modern cloud systems

L Tang, C Bhandari, Y Zhang, A Karanika, S Ji… - Proceedings of the …, 2023 - dl.acm.org
Modern cloud systems are orchestrations of independent and interacting (sub-) systems,
each specializing in important services (eg, data processing, storage, resource …