Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

L Pham, H Ha, H Zhang - Proceedings of the ACM on Software …, 2024 - dl.acm.org
Detecting failures and identifying their root causes promptly and accurately is crucial for
ensuring the availability of microservice systems. A typical failure troubleshooting pipeline …

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

G Yu, P Chen, Z He, Q Yan, Y Luo, F Li… - Proceedings of the ACM …, 2024 - dl.acm.org
In large-scale online service systems, the occurrence of software changes is inevitable and
frequent. Despite rigorous pre-deployment testing practices, the presence of defective …

ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …

Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?

L Pham, H Ha, H Zhang - Proceedings of the 39th IEEE/ACM …, 2024 - dl.acm.org
Microservice architecture has become a popular architecture adopted by many cloud
applications. However, identifying the root cause of a failure in microservice systems is still a …

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

S Xie, J Wang, H He, Z Wang, Y Zhao, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Microservice-based systems often suffer from reliability issues due to their intricate
interactions and expanding scale. With the rapid growth of observability techniques, various …

MicroIRC: Instance-level Root Cause Localization for Microservice Systems

Y Zhu, J Wang, B Li, Y Zhao, Z Zhang, Y Xiong… - Journal of Systems and …, 2024 - Elsevier
The use of microservice architecture is gaining popularity in the development of web
applications. However, identifying the root cause of a failure can be challenging due to the …

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

H Chen, P Chen, G Yu, X Li, Z He… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …

Root Cause Analysis for Microservices based on Causal Inference: How Far Are We?

L Pham, H Ha, H Zhang - 2024 39th IEEE/ACM International …, 2024 - ieeexplore.ieee.org
Microservice architecture has become a popular architecture adopted by many cloud
applications. However, identifying the root cause of a failure in microservice systems is still a …

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

Y Zhu, J Wang, B Li, X Tang, H Li, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
With the development of cloud-native technologies, microservice-based software systems
face challenges in accurately localizing root causes when failures occur. Additionally, the …