Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P Jin… - ACM Transactions on …, 2024 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

Microservice root cause analysis with limited observability through intervention recognition in the latent space

Z Xie, S Zhang, Y Geng, Y Zhang, M Ma, X Nie… - Proceedings of the 30th …, 2024 - dl.acm.org
Many failure root cause analysis (RCA) algorithms for microservices have been proposed
with the widespread adoption of microservices systems. Existing algorithms generally focus …

Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph

Z Yao, C Pei, W Chen, H Wang, L Su, H Jiang… - … Proceedings of the …, 2024 - dl.acm.org
This paper presents Chain-of-Event (CoE), an interpretable model for root cause analysis in
microservice systems that analyzes causal relationships of events transformed from multi …

ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

Y Sun, B Shi, M Mao, M Ma, S Xia, S Zhang… - Proceedings of the 39th …, 2024 - dl.acm.org
Automated incident management is critical for large-scale microservice systems, including
tasks such as anomaly detection (AD), failure triage (FT), and root cause localization (RCL) …

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

G Yu, P Chen, Z He, Q Yan, Y Luo, F Li… - Proceedings of the ACM …, 2024 - dl.acm.org
In large-scale online service systems, the occurrence of software changes is inevitable and
frequent. Despite rigorous pre-deployment testing practices, the presence of defective …

Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization

L Tao, S Zhang, Z Jia, J Sun, M Ma, Z Li, Y Sun… - Proceedings of the 39th …, 2024 - dl.acm.org
Microservice systems are inherently complex and prone to failures, which can significantly
impact user experience. Existing diagnostic approaches based on single-modal data such …

Trastrainer: Adaptive sampling for distributed traces with system runtime state

H Huang, X Zhang, P Chen, Z He, Z Chen… - Proceedings of the …, 2024 - dl.acm.org
Distributed tracing has been widely adopted in many microservice systems and plays an
important role in monitoring and analyzing the system. However, trace data often come in …

Loglead-fast and integrated log loader, enhancer, and anomaly detector

MV Mäntylä, Y Wang, J Nyyssölä - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
This paper introduces LogLead, a tool designed for efficient log analysis benchmarking.
LogLead combines three essential steps in log processing: loading, enhancing, and …