Proactive error prediction to improve storage system reliability

N Wu, Y Xie - ACM Computing Surveys (CSUR), 2022 - dl.acm.org

It has been a long time that computer architecture and systems are optimized for efficient
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …

被引用次数：66 相关文章所有 4 个版本

[PDF] github.io

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org

Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

被引用次数：59 相关文章所有 3 个版本

[PDF] usenix.org

Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org

Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

被引用次数：119 相关文章所有 14 个版本

[PDF] google.com

Disk failure prediction in data centers via online learning

J Xiao, Z Xiong, S Wu, Y Yi, H Jin, K Hu - Proceedings of the 47th …, 2018 - dl.acm.org

Disk failure has become a major concern with the rapid expansion of storage systems in
data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) …

被引用次数：114 相关文章所有 2 个版本

[PDF] usenix.org

Improving service availability of cloud systems by predicting disk error

Y Xu, K Sui, R Yao, H Zhang, Q Lin, Y Dang… - 2018 USENIX Annual …, 2018 - usenix.org

High service availability is crucial for cloud systems. A typical cloud system uses a large
number of physical hard disk drives. Disk errors are one of the most important reasons that …

被引用次数：141 相关文章所有 10 个版本

[PDF] usenix.org

Lessons and actions: What we learned from 10k {SSD-Related} storage system failures

E Xu, M Zheng, F Qin, Y Xu, J Wu - 2019 USENIX Annual Technical …, 2019 - usenix.org

Modern datacenters increasingly use flash-based solid state drives (SSDs) for high
performance and low energy cost. However, SSD introduces more complex failure modes …

被引用次数：71 相关文章所有 9 个版本

[PDF] usenix.org

Cluster storage systems gotta have {HeART}: improving storage efficiency by exploiting disk-reliability heterogeneity

S Kadekodi, KV Rashmi, GR Ganger - 17th USENIX Conference on File …, 2019 - usenix.org

Large-scale cluster storage systems typically consist of a heterogeneous mix of storage
devices with significantly varying failure rates. Despite such differences among devices …

被引用次数：62 相关文章所有 8 个版本

[PDF] usenix.org

Tiger:{Disk-Adaptive} redundancy without placement restrictions

S Kadekodi, F Maturana, S Athlur, A Merchant… - … USENIX Symposium on …, 2022 - usenix.org

Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data
durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to …

被引用次数：15 相关文章所有 9 个版本

[PDF] usenix.org

Multi-view feature-based {SSD} failure prediction: What, when, and why

Y Zhang, W Hao, B Niu, K Liu, S Wang, N Liu… - … USENIX Conference on …, 2023 - usenix.org

Solid state drives (SSDs) play an important role in large-scale data centers. SSD failures
affect the stability of storage systems and cause additional maintenance overhead. To …

被引用次数：8 相关文章所有 5 个版本

[PDF] hengli.org

An empirical study of the impact of data splitting decisions on the performance of AIOps solutions

Y Lyu, H Li, M Sayagh, ZM Jiang… - ACM Transactions on …, 2021 - dl.acm.org

AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help
practitioners handle the massive data produced during the operations of large-scale …

被引用次数：29 相关文章所有 8 个版本

高级搜索

QQ 群