[HTML][HTML] Deep neural networks in the cloud: Review, applications, challenges and research directions

KY Chan, B Abu-Salih, R Qaddoura, AZ Ala'M… - Neurocomputing, 2023 - Elsevier
Deep neural networks (DNNs) are currently being deployed as machine learning technology
in a wide range of important real-world applications. DNNs consist of a huge number of …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Causal inference-based root cause analysis for online service systems with intervention recognition

M Li, Z Li, K Yin, X Nie, W Zhang, K Sui… - Proceedings of the 28th …, 2022 - dl.acm.org
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

R Ding, C Zhang, L Wang, Y Xu, M Ma, X Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of
microservice systems. However, performing RCA on modern microservice systems can be …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P Jin, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Autolog: A log sequence synthesis framework for anomaly detection

Y Huo, Y Li, Y Su, P He, Z Xie… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
The rapid progress of modern computing systems has led to a growing interest in informative
run-time logs. Various log-based anomaly detection techniques have been proposed to …

A survey of graph-based deep learning for anomaly detection in distributed systems

AD Pazho, GA Noghre, AA Purkayastha… - … on Knowledge and …, 2023 - ieeexplore.ieee.org
Anomaly detection is a crucial task in complex distributed systems. A thorough
understanding of the requirements and challenges of anomaly detection is pivotal to the …

EvLog: Identifying Anomalous Logs over Software Evolution

Y Huo, C Lee, Y Su, S Shan, J Liu… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Software logs record system activities, aiding maintainers in identifying the underlying
causes for failures and enabling prompt mitigation actions. However, maintainers need to …