Groot: An event-graph-based approach for root cause analysis in industrial settings

H Wang, Z Wu, H Jiang, Y Huang… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
For large-scale distributed systems, it is crucial to efficiently diagnose the root causes of
incidents to maintain high system availability. The recent development of microservice …

Microhecl: High-efficient root cause localization in large-scale microservice systems

D Liu, C He, X Peng, F Lin, C Zhang… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Availability issues of industrial microservice systems (eg, drop of successfully placed orders
and processed transactions) directly affect the running of the business. These issues are …

Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback

L Wang, C Zhang, R Ding, Y Xu, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org
In microservice systems, the identification of root causes of anomalies is imperative for
service reliability and business impact. This process is typically divided into two phases:(i) …

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

Identifying root-cause metrics for incident diagnosis in online service systems

C Wu, N Zhao, L Wang, X Yang, S Li… - 2021 IEEE 32nd …, 2021 - ieeexplore.ieee.org
Incidents in online service systems could incur poor user experience and tremendous
economic loss. To reduce the influence of incidents and guarantee service reliability, it is …

Localizing failure root causes in a microservice through causality inference

Y Meng, S Zhang, Y Sun, R Zhang, Z Hu… - 2020 IEEE/ACM 28th …, 2020 - ieeexplore.ieee.org
An increasing number of Internet applications are applying microservice architecture due to
its flexibility and clear logic. The stability of microservice is thus vitally important for these …

Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications

M Ma, W Lin, D Pan, P Wang - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
This paper presents a self-adaptive root cause diagnosis framework, named MS-Rank, to
analyze multiple metrics collected from micro-service architecture. MS-Rank decomposes …

Practical root cause localization for microservice systems via trace analysis

Z Li, J Chen, R Jiao, N Zhao, Z Wang… - 2021 IEEE/ACM 29th …, 2021 - ieeexplore.ieee.org
Microservice architecture is applied by an increasing number of systems because of its
benefits on delivery, scalability, and autonomy. It is essential but challenging to localize root …

Graph-based incident aggregation for large-scale online service systems

Z Chen, J Liu, Y Su, H Zhang, X Wen… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
As online service systems continue to grow in terms of complexity and volume, how service
incidents are managed will significantly impact company revenue and user trust. Due to the …

Causal inference techniques for microservice performance diagnosis: Evaluation and guiding recommendations

L Wu, J Tordsson, E Elmroth… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Causal inference (CI) is one of the popular performance diagnosis methods, which infers the
anomaly propagation from the observed data for locating the root causes. Although some …