Scalability bugs: When 100-node testing is not enough

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io

Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

被引用次数：24 相关文章

[PDF] acm.org

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org

Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

被引用次数：51 相关文章所有 6 个版本

[PDF] github.io

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org

Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

被引用次数：38 相关文章所有 3 个版本

[PDF] usenix.org

Metastable failures in the wild

L Huang, M Magnusson, AB Muralikrishna… - … USENIX Symposium on …, 2022 - usenix.org

Recently, Bronson et al. introduced a framework for understanding a class of failures in
distributed systems called metastable failures. The examples of metastable failures …

被引用次数：28 相关文章所有 3 个版本

[PDF] uchicago.edu

What bugs cause production cloud incidents?

H Liu, S Lu, M Musuvathi, S Nath - Proceedings of the Workshop on Hot …, 2019 - dl.acm.org

Cloud services have become the backbone of today's computing world. Runtime incidents,
which adversely affect the expected service operations, are extremely costly in terms of user …

被引用次数：83 相关文章所有 3 个版本

Cloud datacenter workload estimation using error preventive time series forecasting models

J Kumar, AK Singh - Cluster Computing, 2020 - Springer

The workload estimation plays a vital role in efficient management of cloud resources. This
paper introduces the error preventive score (EPS) in time series forecasting models to …

被引用次数：45 相关文章所有 3 个版本

[PDF] nsf.gov

Evaluating scalability bottlenecks by workload extrapolation

R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org

Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …

被引用次数：44 相关文章所有 4 个版本

[PDF] usenix.org

Sliding {Look-Back} Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance

Z Cao, S Liu, F Wu, G Wang, B Li, DHC Du - 17th USENIX Conference …, 2019 - usenix.org

Data deduplication is an effective way of improving storage space utilization. The data
generated by deduplication is persistently stored in data chunks or data containers (a …

被引用次数：47 相关文章所有 10 个版本

[PDF] arxiv.org

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

W Zhang, H Guo, J Yang, Y Zhang, C Yan… - arXiv preprint arXiv …, 2024 - arxiv.org

The escalating complexity of micro-services architecture in cloud-native technologies poses
significant challenges for maintaining system stability and efficiency. To conduct root cause …

被引用次数：2 相关文章所有 2 个版本

[PDF] acm.org

Pbse: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks

RO Suminto, CA Stuardo, A Clark, H Ke… - Proceedings of the …, 2017 - dl.acm.org

We reveal loopholes of Speculative Execution (SE) implementations under a unique fault
model: node-level network throughput degradation. This problem appears in many data …

被引用次数：28 相关文章所有 4 个版本

高级搜索

QQ 群