Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

C Zhang, X Peng, C Sha, K Zhang, Z Fu, X Wu… - Proceedings of the 44th …, 2022 - dl.acm.org
A microservice system in industry is usually a large-scale distributed system consisting of
dozens to thousands of services running in different machines. An anomaly of the system …

Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks

P Liu, H Xu, Q Ouyang, R Jiao, Z Chen… - 2020 IEEE 31st …, 2020 - ieeexplore.ieee.org
The anomalies of microservice invocation traces (traces) often indicate that the quality of the
microservice-based large software service is being impaired. However, timely and …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong, Z Zhong… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern microservice systems have gained widespread adoption due to their high
scalability, flexibility, and extensibility. However, the characteristics of independent …

Self-supervised log parsing

S Nedelkoski, J Bogatinovski, A Acker… - Machine Learning and …, 2021 - Springer
Logs are extensively used during the development and maintenance of software systems.
They collect runtime events and allow tracking of code execution, which enables a variety of …

Autonomous selection of the fault classification models for diagnosing microservice applications

Y Song, R Xin, P Chen, R Zhang, J Chen… - Future Generation …, 2024 - Elsevier
Microservices architecture is a new approach for deploying applications and services in the
cloud, gaining popularity for constructing large-scale systems that are highly resilient, robust …

Robust multimodal failure detection for microservice systems

C Zhao, M Ma, Z Zhong, S Zhang, Z Tan… - Proceedings of the 29th …, 2023 - dl.acm.org
Proactive failure detection of instances is vitally essential to microservice systems because
an instance failure can propagate to the whole system and degrade the system's …

tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces

L Huang, T Zhu - Proceedings of the ACM Symposium on Cloud …, 2021 - dl.acm.org
The traditional approach for performance debugging relies upon performance profilers (eg,
gprof, VTune) that provide average function runtime information. These aggregate statistics …

TraceGra: A trace-based anomaly detection for microservice using graph deep learning

J Chen, F Liu, J Jiang, G Zhong, D Xu, Z Tan… - Computer …, 2023 - Elsevier
Trace is widely used to detect anomalies in distributed microservice systems because of the
capability of precisely reconstructing user request paths. However, most existing trace …