Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Microrca: Root cause localization of performance issues in microservices

L Wu, J Tordsson, E Elmroth… - NOMS 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
Software architecture is undergoing a transition from monolithic architectures to
microservices to achieve resilience, agility and scalability in software development …

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Z Wang, Z Liu, Y Zhang, A Zhong, J Wang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Large language model (LLM) applications in cloud root cause analysis (RCA) have been
actively explored recently. However, current methods are still reliant on manual workflow …

Exploring the potential of distributed computing continuum systems

PK Donta, I Murturi, V Casamayor Pujol, B Sedlak… - Computers, 2023 - mdpi.com
Computing paradigms have evolved significantly in recent decades, moving from large room-
sized resources (processors and memory) to incredibly small computing nodes. Recently …

A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications

J Qiu, Q Du, K Yin, SL Zhang, C Qian - Applied Sciences, 2020 - mdpi.com
With the development of cloud computing technology, the microservice architecture (MSA)
has become a prevailing application architecture in cloud-native applications. Many user …

Autonomous selection of the fault classification models for diagnosing microservice applications

Y Song, R Xin, P Chen, R Zhang, J Chen… - Future Generation …, 2024 - Elsevier
Microservices architecture is a new approach for deploying applications and services in the
cloud, gaining popularity for constructing large-scale systems that are highly resilient, robust …

[HTML][HTML] Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications

R Xin, P Chen, Z Zhao - Journal of Systems and Software, 2023 - Elsevier
Effectively localizing root causes of performance anomalies is crucial to enabling the rapid
recovery and loss mitigation of microservice applications in the cloud. Depending on the …

An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services

T Jia, P Chen, L Yang, Y Li, F Meng… - 2017 IEEE international …, 2017 - ieeexplore.ieee.org
Detecting runtime anomalies is very important to monitoring and maintenance of distributed
services. People often use execution logs for troubleshooting and problem diagnosis …

tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces

L Huang, T Zhu - Proceedings of the ACM Symposium on Cloud …, 2021 - dl.acm.org
The traditional approach for performance debugging relies upon performance profilers (eg,
gprof, VTune) that provide average function runtime information. These aggregate statistics …