Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Semi-supervised log-based anomaly detection via probabilistic label estimation

L Yang, J Chen, Z Wang, W Wang… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
With the growth of software systems, logs have become an important data to aid system
maintenance. Log-based anomaly detection is one of the most important methods for such …

Lilac: Log parsing using llms with adaptive parsing cache

Z Jiang, J Liu, Z Chen, Y Li, J Huang, Y Huo… - Proceedings of the …, 2024 - dl.acm.org
Log parsing transforms log messages into structured formats, serving as the prerequisite
step for various log analysis tasks. Although a variety of log parsing approaches have been …

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Z Wang, Z Liu, Y Zhang, A Zhong, J Wang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Large language model (LLM) applications in cloud root cause analysis (RCA) have been
actively explored recently. However, current methods are still reliant on manual workflow …

How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems

J Chen, S Zhang, X He, Q Lin, H Zhang, D Hao… - Proceedings of the 35th …, 2020 - dl.acm.org
Although tremendous efforts have been devoted to the quality assurance of online service
systems, in reality, these systems still come across many incidents (ie, unplanned …

Llmparser: A llm-based log parsing framework

Z Jiang, J Liu, Z Chen, Y Li, J Huang, Y Huo… - arXiv preprint arXiv …, 2023 - arxiv.org
The process of log parsing, which converts log messages into structured formats, is a crucial
step for various log analysis tasks. Although numerous log parsers have been proposed …

[HTML][HTML] Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications

R Xin, P Chen, Z Zhao - Journal of Systems and Software, 2023 - Elsevier
Effectively localizing root causes of performance anomalies is crucial to enabling the rapid
recovery and loss mitigation of microservice applications in the cloud. Depending on the …

Exploring better black-box test case prioritization via log analysis

Z Chen, J Chen, W Wang, J Zhou, M Wang… - ACM Transactions on …, 2023 - dl.acm.org
Test case prioritization (TCP) has been widely studied in regression testing, which aims to
optimize the execution order of test cases so as to detect more faults earlier. TCP has been …

Real-time incident prediction for online service systems

N Zhao, J Chen, Z Wang, X Peng, G Wang… - Proceedings of the 28th …, 2020 - dl.acm.org
Incidents in online service systems could dramatically degrade system availability and
destroy user experience. To guarantee service quality and reduce economic loss, it is …