A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Research of artificial intelligence operations for wind turbines considering anomaly detection, root cause analysis, and incremental training

C Zhang, D Hu, T Yang - Reliability Engineering & System Safety, 2024 - Elsevier
Artificial intelligence operations (AIOps) is emerging as a novel technology in industrial
automation to improve operation and maintenance (O&M) efficiency through machine …

Aiops solutions for incident management: Technical guidelines and a comprehensive literature review

Y Remil, A Bendimerad, R Mathonat… - arXiv preprint arXiv …, 2024 - arxiv.org
The management of modern IT systems poses unique challenges, necessitating scalability,
reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on …

Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A Review

R Xin, J Wang, P Chen, Z Zhao - ACM Computing Surveys, 2025 - dl.acm.org
Performance diagnosis systems are defined as detecting abnormal performance
phenomena and play a crucial role in cloud applications. An effective performance …

Logrule: Efficient structured log mining for root cause analysis

P Notaro, S Haeri, J Cardoso… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Accurate, timely Root Cause Analysis (RCA) is essential to successful IT operations as a
primary step to incident remediation. RCA automation using data mining techniques in large …

Constructing large-scale real-world benchmark datasets for aiops

Z Li, N Zhao, S Zhang, Y Sun, P Chen, X Wen… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia
and industry to enable automated and effective software service management. Plenty of …

Supporting deep neural network safety analysis and retraining through heatmap-based unsupervised learning

H Fahmy, F Pastore, M Bagherzadeh… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Deep neural networks (DNNs) are increasingly important in safety-critical systems, for
example, in their perception layer to analyze images. Unfortunately, there is a lack of …

Monilog: An automated log-based anomaly detection system for cloud computing infrastructures

A Vervaet - 2021 IEEE 37th International Conference on Data …, 2021 - ieeexplore.ieee.org
Within today's large-scale systems, one anomaly can impact millions of users. Detecting
such events in real-time is essential to maintain the quality of services. It allows the …

Learning dependencies in distributed cloud applications to identify and localize anomalies

D Scheinert, A Acker, L Thamsen… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Operation and maintenance of large distributed cloud applications can quickly become
unmanageably complex, putting human operators under immense stress when problems …

Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds towards consumer digital ecosystems

Y Xu, Z Qiu, H Gao, X Zhao, L Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Consumer digital ecosystems include a large volume of different types of applications, and
those applications are usually deployed in industrial cloud computing systems. Currently …