Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

ADAL-NN: Anomaly detection and localization using deep relational learning in distributed systems

K Ahmed, A Altaf, NSM Jamail, F Iqbal, R Latif - Applied Sciences, 2023 - mdpi.com
Modern distributed systems that operate concurrently generate interleaved logs. Identifiers
(ID) are always associated with active instances or entities in order to track them in logs …

Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems

S Zhang, Y Zhao, X Xiong, Y Sun, X Nie… - … Proceedings of the …, 2024 - dl.acm.org
Timely localization of the root causes of gray failure is essential for maintaining the stability
of the server OS. The previous intrusive gray failure localization methods usually require …

Sparse and semi-attention guided faults diagnosis approach for distributed online services

L Zhang, Y Shi - Applied Soft Computing, 2023 - Elsevier
Despite the rapid advance of unsupervised reconstruction models in online service fault
diagnosis, existing methods still lead to frequent false positive or false negative alarms …

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

Y Tsubouchi, H Tsuruta - IEEE Access, 2024 - ieeexplore.ieee.org
Automated fault localization in large-scale cloud-based applications is challenging because
it involves mining multivariate time series data from large volumes of operational monitoring …

[PDF][PDF] Root Cause Analysis for Distributed Systems

A Fang - cs.nthu.edu.tw
Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud
services. While numerous studies have been proposed, existing surveys typically focus on …