Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automated root causing of cloud incidents using in-context learning with GPT-4

X Zhang, S Ghosh, C Bansal, R Wang, M Ma… - … Proceedings of the …, 2024 - dl.acm.org
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud
services, requiring on-call engineers to identify the primary issues and implement corrective …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A Review

R Xin, J Wang, P Chen, Z Zhao - ACM Computing Surveys, 2025 - dl.acm.org
Performance diagnosis systems are defined as detecting abnormal performance
phenomena and play a crucial role in cloud applications. An effective performance …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Robust multimodal failure detection for microservice systems

C Zhao, M Ma, Z Zhong, S Zhang, Z Tan… - Proceedings of the 29th …, 2023 - dl.acm.org
Proactive failure detection of instances is vitally essential to microservice systems because
an instance failure can propagate to the whole system and degrade the system's …

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

R Ding, C Zhang, L Wang, Y Xu, M Ma, X Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of
microservice systems. However, performing RCA on modern microservice systems can be …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P Jin, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …