How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Detection is better than cure: A cloud incidents perspective

V Ganatra, A Parayil, S Ghosh, Y Kang, M Ma… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud providers use automated watchdogs or monitors to continuously observe service
availability and to proactively report incidents when system performance degrades. Improper …

What bugs cause production cloud incidents?

H Liu, S Lu, M Musuvathi, S Nath - Proceedings of the Workshop on Hot …, 2019 - dl.acm.org
Cloud services have become the backbone of today's computing world. Runtime incidents,
which adversely affect the expected service operations, are extremely costly in terms of user …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Cloud incident data: An empirical analysis

L Fiondella, SS Gokhale… - 2013 IEEE International …, 2013 - ieeexplore.ieee.org
This paper presents an empirical analysis of cloud incidents reported in the Cloutage. org
database. The trend, causes, and impact of three types of incidents, namely, Outage …

Mining root cause knowledge from cloud service incident investigations for aiops

A Saha, SCH Hoi - Proceedings of the 44th International Conference on …, 2022 - dl.acm.org
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as
well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce …

Efficient customer incident triage via linking with system incidents

J Gu, J Wen, Z Wang, P Zhao, C Luo, Y Kang… - Proceedings of the 28th …, 2020 - dl.acm.org
In cloud service systems, customers will report the service issues they have encountered to
cloud service providers. Despite many issues can be handled by the support team …

An intelligent framework for timely, accurate, and comprehensive cloud incident detection

Y Li, X Zhang, S He, Z Chen, Y Kang, J Liu… - ACM SIGOPS …, 2022 - dl.acm.org
Cloud incidents (service interruptions or performance degradation) dramatically degrade the
reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss …

Fighting the fog of war: Automated incident detection for cloud systems

L Li, X Zhang, X Zhao, H Zhang, Y Kang… - 2021 USENIX Annual …, 2021 - usenix.org
Incidents and outages dramatically degrade the availability of large-scale cloud computing
systems such as AWS, Azure, and GCP. In current incident response practice, each team …

Efficient incident identification from multi-dimensional issue reports via meta-heuristic search

J Gu, C Luo, S Qin, B Qiao, Q Lin, H Zhang… - Proceedings of the 28th …, 2020 - dl.acm.org
In large-scale cloud systems, unplanned service interruptions and outages may cause
severe degradation of service availability. Such incidents can occur in a bursty manner …