Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Anomaly detection and diagnosis for cloud services: Practical experiments and lessons learned

C Sauvanaud, M Kaâniche, K Kanoun, K Lazri… - Journal of Systems and …, 2018 - Elsevier
The dependability of cloud computing services is a major concern of cloud providers. In
particular, anomaly detection techniques are crucial to detect anomalous service behaviors …

Anomaly detection and diagnosis for container-based microservices with performance monitoring

Q Du, T Xie, Y He - Algorithms and Architectures for Parallel Processing …, 2018 - Springer
With emerging container technologies, such as Docker, microservices-based applications
can be developed and deployed in cloud environment much agiler. The dependability of …

Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework

R Xin, H Liu, P Chen, Z Zhao - Journal of Cloud Computing, 2023 - Springer
Effectively detecting run-time performance anomalies is crucial for clouds to identify
abnormal performance behavior and forestall future incidents. To be used for real-world …

Perfaugur: Robust diagnostics for performance anomalies in cloud services

S Roy, AC König, I Dvorkin… - 2015 IEEE 31st …, 2015 - ieeexplore.ieee.org
Cloud platforms involve multiple independently developed components, often executing on
diverse hardware configurations and across multiple data centers. This complexity makes …

What bugs cause production cloud incidents?

H Liu, S Lu, M Musuvathi, S Nath - Proceedings of the Workshop on Hot …, 2019 - dl.acm.org
Cloud services have become the backbone of today's computing world. Runtime incidents,
which adversely affect the expected service operations, are extremely costly in terms of user …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Automatic anomaly detection in the cloud via statistical learning

J Hochenbaum, OS Vallis, A Kejariwal - arXiv preprint arXiv:1704.07706, 2017 - arxiv.org
Performance and high availability have become increasingly important drivers, amongst
other drivers, for user retention in the context of web services such as social networks, and …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models

A Samir, C Pahl - 2019 7th International Conference on Future …, 2019 - ieeexplore.ieee.org
Container-based microservice architectures are emerging as a new approach for building
distributed applications as a collection of independent services that works together. As a …