Y Chen, X Yang, Q Lin, H Zhang, F Gao, Z Xu… - The world wide web …, 2019 - dl.acm.org
With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically …
Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability …
In cloud-native applications, a large fraction of operational failures, known as outages, result in violations of Service Level Objectives (SLOs). SLOs are defined around specific …
T Yang, J Shen, Y Su, X Ling, Y Yang… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
Service reliability is one of the key challenges that cloud providers have to deal with. In cloud systems, unplanned service failures may cause severe cascading impacts on their …
P Kayongo, J Hoffswell, S Saini, S Garg… - 2022 Working …, 2022 - ieeexplore.ieee.org
Efficient outage detection and remediation is crucial for effectively operating cloud computing systems. To remediate outages, system engineers must quickly identify the …
J Shi, S Jiang, B Xu, Y Xiao - 2023 IEEE 34th International …, 2023 - ieeexplore.ieee.org
The development of the information technology industry has made servers an essential infrastructure for enterprises. Server failure may result in significant economic losses …
A Saha, SCH Hoi - Proceedings of the 44th International Conference on …, 2022 - dl.acm.org
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce …
H Kumar, R Mahindru, D Kar - Proceedings of the 30th ACM Joint …, 2022 - dl.acm.org
For a cloud service provider, the goal is to proactively identify signals that can help reduce outages and/or reduce the mean-time-to-detect and mean-time-to-resolve. After an incident …
C Bansal, S Renganathan, A Asudani, O Midy… - Proceedings of the …, 2020 - dl.acm.org
Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the …