{Push-Button} Reliability Testing for {Cloud-Backed} Applications with Rainmaker

D Roy, X Zhang, R Bhave, C Bansal… - … Proceedings of the …, 2024 - dl.acm.org

The growing complexity of cloud based software systems has resulted in incident
management becoming an integral part of the software development lifecycle. Root cause …

被引用次数：24 相关文章所有 2 个版本

[PDF] github.io

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io

Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

被引用次数：24 相关文章

[PDF] acm.org

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org

Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

被引用次数：55 相关文章所有 6 个版本

[PDF] acm.org

Acto: Automatic end-to-end testing for operation correctness of cloud system management

JT Gu, X Sun, W Zhang, Y Jiang, C Wang… - Proceedings of the 29th …, 2023 - dl.acm.org

Cloud systems are increasingly being managed by operation programs termed operators,
which automate tedious, human-based operations. Operators of modern management …

被引用次数：19 相关文章所有 7 个版本

[PDF] acm.org

If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

BA Stoica, U Sethi, Y Su, C Zhou, S Lu, J Mace… - Proceedings of the …, 2024 - dl.acm.org

Retry---the re-execution of a task on failure---is a common mechanism to enable resilient
software systems. Yet, despite its commonality and long history, retry remains difficult to …

被引用次数：2 相关文章所有 4 个版本

[PDF] acm.org

When your infrastructure is a buggy program: Understanding faults in infrastructure as code ecosystems

GP Drosos, T Sotiropoulos, G Alexopoulos… - Proceedings of the …, 2024 - dl.acm.org

Modern applications have become increasingly complex and their manual installation and
configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as …

被引用次数：3 相关文章所有 5 个版本

[PDF] orderlab.io

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

J Pan, H Wu, T Leesatapornwongsa, S Nath… - Proceedings of the …, 2024 - dl.acm.org

Debugging a failure usually requires reproducing it first. This can be hard for failures in
production distributed systems, where bugs are exposed only by some unusual faulty …

Multi-Grained Specifications for Distributed System Model Checking and Verification

L Ouyang, X Sun, R Tang, Y Huang, M Jivrajani… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents our experience specifying and verifying the correctness of ZooKeeper, a
complex and evolving distributed coordination system. We use TLA+ to model fine-grained …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Style Transfer: From Stitching to Neural Networks

X Xu, Z Wang, Y Zhang, Y Liu, Z Wang… - … Conference on Big …, 2024 - ieeexplore.ieee.org

This article compares two style transfer methods in image processing: the traditional method,
which synthesizes new images by stitching together small patches from existing pattern …

被引用次数：1 相关文章所有 3 个版本

[PDF] acm.org

Can My Microservice Tolerate an Unreliable Database? Resilience Testing with Fault Injection and Visualization

M Assad, CS Meiklejohn, H Miller… - … of the 2024 IEEE/ACM 46th …, 2024 - dl.acm.org

In microservice applications, ensuring resilience during database or service disruptions
constitutes a significant challenge. While several tools address resilience testing for service …

被引用次数：1 相关文章所有 4 个版本

高级搜索

QQ 群