Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …
Production incidents in today's large-scale cloud services can be extremely expensive in terms of customer impacts and engineering resources required to mitigate them. Despite …
Recently, Bronson et al. introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures …
Cloud services have become the backbone of today's computing world. Runtime incidents, which adversely affect the expected service operations, are extremely costly in terms of user …
The workload estimation plays a vital role in efficient management of cloud resources. This paper introduces the error preventive score (EPS) in time series forecasting models to …
R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org
Testing a scalability bottleneck requires a large system to generate sufficient load, which is usually not accessible to researchers. To address this problem, this paper extrapolates the …
Z Cao, S Liu, F Wu, G Wang, B Li, DHC Du - 17th USENIX Conference …, 2019 - usenix.org
Data deduplication is an effective way of improving storage space utilization. The data generated by deduplication is persistently stored in data chunks or data containers (a …
The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause …
RO Suminto, CA Stuardo, A Clark, H Ke… - Proceedings of the …, 2017 - dl.acm.org
We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degradation. This problem appears in many data …