Exploring llm-based agents for root cause analysis

D Roy, X Zhang, R Bhave, C Bansal… - … Proceedings of the …, 2024 - dl.acm.org
The growing complexity of cloud based software systems has resulted in incident
management becoming an integral part of the software development lifecycle. Root cause …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Acto: Automatic end-to-end testing for operation correctness of cloud system management

JT Gu, X Sun, W Zhang, Y Jiang, C Wang… - Proceedings of the 29th …, 2023 - dl.acm.org
Cloud systems are increasingly being managed by operation programs termed operators,
which automate tedious, human-based operations. Operators of modern management …

If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

BA Stoica, U Sethi, Y Su, C Zhou, S Lu, J Mace… - Proceedings of the …, 2024 - dl.acm.org
Retry---the re-execution of a task on failure---is a common mechanism to enable resilient
software systems. Yet, despite its commonality and long history, retry remains difficult to …

When your infrastructure is a buggy program: Understanding faults in infrastructure as code ecosystems

GP Drosos, T Sotiropoulos, G Alexopoulos… - Proceedings of the …, 2024 - dl.acm.org
Modern applications have become increasingly complex and their manual installation and
configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as …

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

J Pan, H Wu, T Leesatapornwongsa, S Nath… - Proceedings of the …, 2024 - dl.acm.org
Debugging a failure usually requires reproducing it first. This can be hard for failures in
production distributed systems, where bugs are exposed only by some unusual faulty …

Multi-Grained Specifications for Distributed System Model Checking and Verification

L Ouyang, X Sun, R Tang, Y Huang, M Jivrajani… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents our experience specifying and verifying the correctness of ZooKeeper, a
complex and evolving distributed coordination system. We use TLA+ to model fine-grained …

Style Transfer: From Stitching to Neural Networks

X Xu, Z Wang, Y Zhang, Y Liu, Z Wang… - … Conference on Big …, 2024 - ieeexplore.ieee.org
This article compares two style transfer methods in image processing: the traditional method,
which synthesizes new images by stitching together small patches from existing pattern …

Can My Microservice Tolerate an Unreliable Database? Resilience Testing with Fault Injection and Visualization

M Assad, CS Meiklejohn, H Miller… - … of the 2024 IEEE/ACM 46th …, 2024 - dl.acm.org
In microservice applications, ensuring resilience during database or service disruptions
constitutes a significant challenge. While several tools address resilience testing for service …