Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

C Zhang, X Peng, C Sha, K Zhang, Z Fu, X Wu… - Proceedings of the 44th …, 2022 - dl.acm.org
A microservice system in industry is usually a large-scale distributed system consisting of
dozens to thousands of services running in different machines. An anomaly of the system …

Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback

L Wang, C Zhang, R Ding, Y Xu, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org
In microservice systems, the identification of root causes of anomalies is imperative for
service reliability and business impact. This process is typically divided into two phases:(i) …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong, Z Zhong… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern microservice systems have gained widespread adoption due to their high
scalability, flexibility, and extensibility. However, the characteristics of independent …

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

R Ding, C Zhang, L Wang, Y Xu, M Ma, X Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of
microservice systems. However, performing RCA on modern microservice systems can be …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P Jin, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Look Deep into the Microservice System Anomaly through Very Sparse Logs

X Jiang, Y Pan, M Ma, P Wang - … of the ACM Web Conference 2023, 2023 - dl.acm.org
Intensive monitoring and anomaly diagnosis have become a knotty problem for modern
microservice architecture due to the dynamics of service dependency. While most previous …

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

L Pham, H Ha, H Zhang - Proceedings of the ACM on Software …, 2024 - dl.acm.org
Detecting failures and identifying their root causes promptly and accurately is crucial for
ensuring the availability of microservice systems. A typical failure troubleshooting pipeline …

DyCause: Crowdsourcing to Diagnose Microservice Kernel Failure

Y Pan, M Ma, X Jiang, P Wang - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Today many web applications in the cloud (apps) are built based on microservices.
However, as the anomaly propagates in a highly dynamic and complex way, troubleshooting …

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Y Pan, Y Zhang, T Bi, L Han, Y Zhang, M Ma… - Proceedings of the …, 2023 - dl.acm.org
This study demonstrates the salient facts and challenges of host failure operations in
hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics …