What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org
The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

Exploring event correlation for failure prediction in coalitions of clusters

S Fu, CZ Xu - Proceedings of the 2007 ACM/IEEE conference on …, 2007 - dl.acm.org
In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

X Wei, J Wang, C Sun, D Towey… - Journal of Software …, 2024 - Wiley Online Library
Distributed systems have been widely used in many safety‐critical areas. Any abnormalities
(eg, service interruption or service quality degradation) could lead to application crashes or …

A large-scale study of soft-errors on GPUs in the field

B Nie, D Tiwari, S Gupta, E Smirni… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …

Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility

D Tiwari, S Gupta, G Gallarno, J Rogers… - Proceedings of the …, 2015 - dl.acm.org
The high computational capability of graphics processing units (GPUs) is enabling and
driving the scientific discovery process at large-scale. The world's second fastest …