A survey of machine learning for computer architecture and systems

N Wu, Y Xie - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
It has been a long time that computer architecture and systems are optimized for efficient
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

Disk failure prediction in data centers via online learning

J Xiao, Z Xiong, S Wu, Y Yi, H Jin, K Hu - Proceedings of the 47th …, 2018 - dl.acm.org
Disk failure has become a major concern with the rapid expansion of storage systems in
data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) …

Improving service availability of cloud systems by predicting disk error

Y Xu, K Sui, R Yao, H Zhang, Q Lin, Y Dang… - 2018 USENIX Annual …, 2018 - usenix.org
High service availability is crucial for cloud systems. A typical cloud system uses a large
number of physical hard disk drives. Disk errors are one of the most important reasons that …

Lessons and actions: What we learned from 10k {SSD-Related} storage system failures

E Xu, M Zheng, F Qin, Y Xu, J Wu - 2019 USENIX Annual Technical …, 2019 - usenix.org
Modern datacenters increasingly use flash-based solid state drives (SSDs) for high
performance and low energy cost. However, SSD introduces more complex failure modes …

Cluster storage systems gotta have {HeART}: improving storage efficiency by exploiting disk-reliability heterogeneity

S Kadekodi, KV Rashmi, GR Ganger - 17th USENIX Conference on File …, 2019 - usenix.org
Large-scale cluster storage systems typically consist of a heterogeneous mix of storage
devices with significantly varying failure rates. Despite such differences among devices …

Tiger:{Disk-Adaptive} redundancy without placement restrictions

S Kadekodi, F Maturana, S Athlur, A Merchant… - … USENIX Symposium on …, 2022 - usenix.org
Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data
durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to …

Multi-view feature-based {SSD} failure prediction: What, when, and why

Y Zhang, W Hao, B Niu, K Liu, S Wang, N Liu… - … USENIX Conference on …, 2023 - usenix.org
Solid state drives (SSDs) play an important role in large-scale data centers. SSD failures
affect the stability of storage systems and cause additional maintenance overhead. To …

An empirical study of the impact of data splitting decisions on the performance of AIOps solutions

Y Lyu, H Li, M Sayagh, ZM Jiang… - ACM Transactions on …, 2021 - dl.acm.org
AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help
practitioners handle the massive data produced during the operations of large-scale …