MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems

L Zheng, Z Chen, J He, H Chen - Proceedings of the ACM on Web …, 2024 - dl.acm.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …

Multi-modal Causal Structure Learning and Root Cause Analysis

L Zheng, Z Chen, J He, H Chen - arXiv preprint arXiv:2402.02357, 2024 - arxiv.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …

Root Cause Analysis in Microservice Using Neural Granger Causal Discovery

CM Lin, C Chang, WY Wang, KD Wang… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In recent years, microservices have gained widespread adoption in IT operations due to
their scalability, maintenance, and flexibility. However, it becomes challenging for site …

Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph

Z Yao, C Pei, W Chen, H Wang, L Su, H Jiang… - … Proceedings of the …, 2024 - dl.acm.org
This paper presents Chain-of-Event (CoE), an interpretable model for root cause analysis in
microservice systems that analyzes causal relationships of events transformed from multi …

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

R Ding, C Zhang, L Wang, Y Xu, M Ma, X Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of
microservice systems. However, performing RCA on modern microservice systems can be …

Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback

L Wang, C Zhang, R Ding, Y Xu, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org
In microservice systems, the identification of root causes of anomalies is imperative for
service reliability and business impact. This process is typically divided into two phases:(i) …

Causal inference techniques for microservice performance diagnosis: Evaluation and guiding recommendations

L Wu, J Tordsson, E Elmroth… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Causal inference (CI) is one of the popular performance diagnosis methods, which infers the
anomaly propagation from the observed data for locating the root causes. Although some …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Groot: An event-graph-based approach for root cause analysis in industrial settings

H Wang, Z Wu, H Jiang, Y Huang… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
For large-scale distributed systems, it is crucial to efficiently diagnose the root causes of
incidents to maintain high system availability. The recent development of microservice …