InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microserivces Cloud-Native Applications

R Rouf, M Rasolroveicy, M Litoiu, S Nagar… - Proceedings of the 15th …, 2024 - dl.acm.org
As microservice and cloud computing operations increasingly adopt automation, the
importance of models for fostering resilient and efficient adaptive architectures becomes …

Root Cause Analysis for Cloud-native Applications

B Żurkowski, K Zieliński - IEEE Transactions on Cloud …, 2024 - ieeexplore.ieee.org
Root cause analysis (RCA) is a critical component in maintaining the reliability and
performance of modern cloud applications. However, due to the inherent complexity of cloud …

Practical and Scalable ML-Driven Cloud Performance Debugging With Sage

Y Gan, M Liang, S Dev, D Lo, C Delimitrou - IEEE Micro, 2022 - ieeexplore.ieee.org
Cloud applications are increasingly shifting from large monolithic services to complex
graphs of loosely coupled microservices. Despite their benefits, microservices are prone to …

Self-adaptive root cause diagnosis for large-scale microservice architecture

M Ma, W Lin, D Pan, P Wang - IEEE Transactions on Services …, 2020 - ieeexplore.ieee.org
The emergence of microservice architecture in Cloud systems poses a new challenges for
the reliability operation and maintenance. Due to numerous services and diverse types of …

Sage: practical and scalable ML-driven performance debugging in microservices

Y Gan, M Liang, S Dev, D Lo, C Delimitrou - Proceedings of the 26th …, 2021 - dl.acm.org
Cloud applications are increasingly shifting from large monolithic services to complex
graphs of loosely-coupled microservices. Despite the advantages of modularity and …

Causal modeling based fault localization in cloud systems using golden signals

P Aggarwal, S Nagar, A Gupta… - 2021 IEEE 14th …, 2021 - ieeexplore.ieee.org
In cloud-native applications, a large fraction of operational failures, known as outages, result
in violations of Service Level Objectives (SLOs). SLOs are defined around specific …

Fault injection to generate failure data for failure prediction: A case study

JR Campos, E Costa - 2020 IEEE 31st International …, 2020 - ieeexplore.ieee.org
Due to the complexity of modern software, identifying every fault before deployment is
extremely difficult or even not possible. Such residual faults can ultimately lead to failures …

MTG_CD: Multi-scale learnable transformation graph for fault classification and diagnosis in microservices

J Chen, R Zhang, P Chen, J Ren, Z Wu, Y Wang… - Journal of Cloud …, 2024 - Springer
The rapid advancement of microservice architecture in the cloud has led to the necessity of
effectively detecting, classifying, and diagnosing run failures in microservice applications …

Enhancing fault localization in microservices systems through span-level using graph convolutional networks

H Kong, T Li, J Ge, L Zhang, L Li - Automated Software Engineering, 2024 - Springer
In the domain of cloud computing and distributed systems, microservices architecture has
become preeminent due to its scalability and flexibility. However, the distributed nature of …

Sage: Using unsupervised learning for scalable performance debugging in microservices

Y Gan, M Liang, S Dev, D Lo, C Delimitrou - arXiv preprint arXiv …, 2021 - arxiv.org
Cloud applications are increasingly shifting from large monolithic services to complex
graphs of loosely-coupled microservices. Despite the advantages of modularity and …