Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Aiops solutions for incident management: Technical guidelines and a comprehensive literature review

Y Remil, A Bendimerad, R Mathonat… - arXiv preprint arXiv …, 2024 - arxiv.org
The management of modern IT systems poses unique challenges, necessitating scalability,
reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on …

Incremental causal graph learning for online root cause analysis

D Wang, Z Chen, Y Fu, Y Liu, H Chen - Proceedings of the 29th ACM …, 2023 - dl.acm.org
The task of root cause analysis (RCA) is to identify the root causes of system faults/failures
by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure …

ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …

Causil: Causal graph for instance level microservice data

S Chakraborty, S Garg, S Agarwal, A Chauhan… - Proceedings of the …, 2023 - dl.acm.org
AI-based monitoring has become crucial for cloud-based services due to its scale. A
common approach to AI-based monitoring is to detect causal relationships among service …

Case studies of causal discovery from it monitoring time series

A Aït-Bachir, CK Assaad, C de Bignicourt… - arXiv preprint arXiv …, 2023 - arxiv.org
Information technology (IT) systems are vital for modern businesses, handling data storage,
communication, and process automation. Monitoring these systems is crucial for their proper …

Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis

D Wang, Z Chen, Y Fu, Y Liu, H Chen - arXiv preprint arXiv:2305.10638, 2023 - arxiv.org
The task of root cause analysis (RCA) is to identify the root causes of system faults/failures
by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure …

A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems

L Zheng, Z Chen, J He, H Chen - Proceedings of the ACM on Web …, 2024 - dl.acm.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …