Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

An analysis of resilience techniques for exascale computing platforms

D Dauwe, S Pasricha, AA Maciejewski… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
With the increase in the complexity and number of nodes in large-scale high performance
computing (HPC) systems, the probability of applications experiencing failures has …

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Lessons learned from memory errors observed over the lifetime of Cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org
Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

Reducing waste in extreme scale systems through introspective analysis

L Bautista-Gomez, A Gainaru… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
Resilience is an important challenge for extreme-scale supercomputers. Today, failures in
supercomputers are assumed to be uniformly distributed in time. However, recent studies …

Hpc hardware design reliability benchmarking with hdfit

P Omland, A Netti, Y Peng, A Baldovin… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become
more concerning, particularly at the scale of High-Performance Computing (HPC) systems …

Codesign challenges for exascale systems: Performance, power, and reliability

D Kerbyson, A Vishnu, K Barker, A Hoisie - Computer, 2011 - ieeexplore.ieee.org
The complexity of large-scale parallel systems necessitates the simultaneous optimization of
multiple hardware and software components to meet performance, efficiency, and fault …

Reading between the lines of failure logs: Understanding how HPC systems fail

N El-Sayed, B Schroeder - 2013 43rd annual IEEE/IFIP …, 2013 - ieeexplore.ieee.org
As the component count in supercomputing installations continues to increase, system
reliability is becoming one of the major issues in designing HPC systems. These issues will …