HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Y Pan, Y Zhang, T Bi, L Han, Y Zhang, M Ma… - Proceedings of the …, 2023 - dl.acm.org
This study demonstrates the salient facts and challenges of host failure operations in
hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics …

Multi-modal Causal Structure Learning and Root Cause Analysis

L Zheng, Z Chen, J He, H Chen - arXiv preprint arXiv:2402.02357, 2024 - arxiv.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

Y Gan, G Liu, X Zhang, Q Zhou, J Wu… - Proceedings of the 28th …, 2023 - dl.acm.org
Cloud microservices are being scaled up due to the rising demand for new features and the
convenience of cloud-native technologies. However, the growing scale of microservices …

Multilayered Fault Detection and Localization With Transformer for Microservice Systems

J Wang, Y Li, Q Qi, Y Lu, B Wu - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Software architecture is undergoing a transition from monolithic architecture to microservices
to achieve resilience, agility, and scalability in the software life cycle. The complex …

[PDF][PDF] Root Cause Analysis of Outliers with Missing Structural Knowledge

N Okati, SHG Mejia, WR Orchard, P Blöbaum… - stat, 2024 - arxiv.org
Recent work conceptualized root cause analysis (RCA) of anomalies via quantitative
contribution analysis using causal counterfactuals in structural causal models (SCMs). The …

Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer

S Agarwal, S Chakraborty, S Garg, S Bisht… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to
retain customers and prevent revenue loss, it is important to provide high reliability …

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

Y Tsubouchi, H Tsuruta - IEEE Access, 2024 - ieeexplore.ieee.org
Automated fault localization in large-scale cloud-based applications is challenging because
it involves mining multivariate time series data from large volumes of operational monitoring …

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

X Zhang, S Ghosh, C Bansal, R Wang, M Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud
services, requiring on-call engineers to identify the primary issues and implement corrective …

Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation Analysis

J Yang, Y Guo, Y Chen, Y Zhao - Applied Sciences, 2023 - mdpi.com
Microservice architecture has been widely adopted by large-scale applications. Due to the
huge amount of data and complex microservice dependency, it also poses new challenges …

The PetShop Dataset--Finding Causes of Performance Issues across Microservices

M Hardt, W Orchard, P Blöbaum… - arXiv preprint arXiv …, 2023 - arxiv.org
Identifying root causes for unexpected or undesirable behavior in complex systems is a
prevalent challenge. This issue becomes especially crucial in modern cloud applications …