A survey of online failure prediction methods

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

[PDF][PDF] Failure trends in a large disk drive population

E Pinheiro, WD Weber, LA Barroso - 2007 - usenix.org
It is estimated that over 90% of all new information produced in the world is being stored on
magnetic media, most of it on hard disk drives. Despite their importance, there is relatively …

Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

[PDF][PDF] Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application.

JF Murray, GF Hughes, K Kreutz-Delgado… - Journal of Machine …, 2005 - jmlr.org
We compare machine learning methods applied to a difficult real-world problem: predicting
computer hard-drive failure using attributes monitored internally by individual drives. The …

Health status assessment and failure prediction for hard drives with recurrent neural networks

C Xu, G Wang, X Liu, D Guo… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
Recently, in order to improve reactive fault tolerance techniques in large scale storage
systems, researchers have proposed various statistical and machine learning methods …

Hard drive failure prediction using classification and regression trees

J Li, X Ji, Y Jia, B Zhu, G Wang, Z Li… - 2014 44th annual ieee …, 2014 - ieeexplore.ieee.org
Some statistical and machine learning methods have been proposed to build hard drive
prediction models based on the SMART attributes, and have achieved good prediction …

Proactive drive failure prediction for large scale storage systems

B Zhu, G Wang, X Liu, D Hu, S Lin… - 2013 IEEE 29th …, 2013 - ieeexplore.ieee.org
Most of the modern hard disk drives support Self-Monitoring, Analysis and Reporting
Technology (SMART), which can monitor internal attributes of individual drives and predict …

RAIDShield: characterizing, monitoring, and proactively protecting against disk failures

A Ma, R Traylor, F Douglis, M Chamness, G Lu… - ACM Transactions on …, 2015 - dl.acm.org
Modern storage systems orchestrate a group of disks to achieve their performance and
reliability goals. Even though such systems are designed to withstand the failure of …