ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …

Outage prediction and diagnosis for cloud service systems

Y Chen, X Yang, Q Lin, H Zhang, F Gao, Z Xu… - The world wide web …, 2019 - dl.acm.org
With the rapid growth of cloud service systems and their increasing complexity, service
failures become unavoidable. Outages, which are critical service failures, could dramatically …

Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer

S Agarwal, S Chakraborty, S Garg, S Bisht… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to
retain customers and prevent revenue loss, it is important to provide high reliability …

Causal modeling based fault localization in cloud systems using golden signals

P Aggarwal, S Nagar, A Gupta… - 2021 IEEE 14th …, 2021 - ieeexplore.ieee.org
In cloud-native applications, a large fraction of operational failures, known as outages, result
in violations of Service Level Objectives (SLOs). SLOs are defined around specific …

AID: efficient prediction of aggregated intensity of dependency in large-scale cloud systems

T Yang, J Shen, Y Su, X Ling, Y Yang… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
Service reliability is one of the key challenges that cloud providers have to deal with. In
cloud systems, unplanned service failures may cause severe cascading impacts on their …

ViSRE: A unified visual analysis dashboard for proactive cloud outage management

P Kayongo, J Hoffswell, S Saini, S Garg… - 2022 Working …, 2022 - ieeexplore.ieee.org
Efficient outage detection and remediation is crucial for effectively operating cloud
computing systems. To remediate outages, system engineers must quickly identify the …

ServerRCA: Root Cause Analysis for Server Failure using Operating System Logs

J Shi, S Jiang, B Xu, Y Xiao - 2023 IEEE 34th International …, 2023 - ieeexplore.ieee.org
The development of the information technology industry has made servers an essential
infrastructure for enterprises. Server failure may result in significant economic losses …

Mining root cause knowledge from cloud service incident investigations for aiops

A Saha, SCH Hoi - Proceedings of the 44th International Conference on …, 2022 - dl.acm.org
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as
well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce …

Assess and summarize: Improve outage understanding with large language models

P Jin, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Predicting breakdowns in cloud services (with SPIKE)

J Chen, J Chakraborty, P Clark, K Haverlock… - Proceedings of the …, 2019 - dl.acm.org
Maintaining web-services is a mission-critical task where any down-time means loss of
revenue and reputation (of being a reliable service provider). In the current competitive web …