Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong, Z Zhong… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern microservice systems have gained widespread adoption due to their high
scalability, flexibility, and extensibility. However, the characteristics of independent …

HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources

Z Zhu, C Lee, X Tang, P He - ACM Transactions on Software …, 2024 - dl.acm.org
Microservices architecture improves software scalability, resilience, and agility but also
poses significant challenges to system reliability due to their complexity and dynamic nature …

[PDF][PDF] ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

Y Sun, B Shi, M Mao, M Ma, S Xia, S Zhang… - Proceedings of the 39th …, 2024 - nkcs.iops.ai
Automated incident management is critical for large-scale microservice systems, including
tasks such as anomaly detection (AD), failure triage (FT), and root cause localization (RCL) …

Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space

Z Xie, S Zhang, Y Geng, Y Zhang, M Ma, X Nie… - Proceedings of the 30th …, 2024 - dl.acm.org
Many failure root cause analysis (RCA) algorithms for microservices have been proposed
with the widespread adoption of microservices systems. Existing algorithms generally focus …

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Y Pan, Y Zhang, T Bi, L Han, Y Zhang, M Ma… - Proceedings of the …, 2023 - dl.acm.org
This study demonstrates the salient facts and challenges of host failure operations in
hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics …

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

Y Zhu, J Wang, B Li, X Tang, H Li, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
With the development of cloud-native technologies, microservice-based software systems
face challenges in accurately localizing root causes when failures occur. Additionally, the …

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

S Xie, J Wang, H He, Z Wang, Y Zhao, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Microservice-based systems often suffer from reliability issues due to their intricate
interactions and expanding scale. With the rapid growth of observability techniques, various …

DGERCL: A Dynamic Graph Embedding Approach for Root Cause Localization in Microservice Systems

H Cheng, Q Li, B Liu, S Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Root cause localization in microservice systems refers to finding the root cause that causes
system anomalies using system information. Many methods construct a graph structure and …

[PDF][PDF] Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization

L Tao, S Zhang, Z Jia, J Sun, M Ma, Z Li… - Proceedings of the 39th …, 2024 - nkcs.iops.ai
Microservice systems are inherently complex and prone to failures, which can significantly
impact user experience. Existing diagnostic approaches based on single-modal data such …

A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management

Y Sun, J Wang, Z Li, X Nie, M Ma, S Zhang, Y Ji… - arXiv preprint arXiv …, 2024 - arxiv.org
AIOps algorithms play a crucial role in the maintenance of microservice systems. Many
previous benchmarks' performance leaderboard provides valuable guidance for selecting …