Resilience-aware resource management for exascale computing systems

D Dauwe, S Pasricha, AA Maciejewski… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
With the increases in complexity and number of nodes in large-scale high performance
computing (HPC) systems over time, the probability of applications experiencing runtime …

An analysis of resilience techniques for exascale computing platforms

D Dauwe, S Pasricha, AA Maciejewski… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
With the increase in the complexity and number of nodes in large-scale high performance
computing (HPC) systems, the probability of applications experiencing failures has …

A performance and energy comparison of fault tolerance techniques for exascale computing systems

D Dauwe, S Pasricha, AA Maciejewski… - … on Computer and …, 2016 - ieeexplore.ieee.org
As the computing power of large scale computing systems increases exponentially with time,
their failure rates are increasing exponentially as well. While current high performance …

Simulating application resilience at exascale

R Riesen, KB Ferreira, MR Varela, M Taufer… - Euro-Par 2011: Parallel …, 2012 - Springer
The reliability mechanisms for future exascale systems will be a key aspect of their
scalability and performance. With the expected jump in hardware component counts, faults …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Towards new metrics for high-performance computing resilience

S Hukerikar, RA Ashraf, C Engelmann - … of the 2017 Workshop on Fault …, 2017 - dl.acm.org
Ensuring the reliability of applications is becoming an increasingly important challenge as
high-performance computing (HPC) systems experience an ever-growing number of faults …

Asking the right questions: benchmarking fault-tolerant extreme-scale systems

PM Widener, KB Ferreira, S Levy, PG Bridges… - Euro-Par 2013: Parallel …, 2014 - Springer
Much recent research has explored fault-tolerance mechanisms intended for current and
future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions …

Using simulation to evaluate the performance of resilience strategies at scale

S Levy, B Topp, KB Ferreira, D Arnold, T Hoefler… - … and Simulation: 4th …, 2014 - Springer
Fault-tolerance has been identified as a major challenge for future extreme-scale systems.
Current predictions suggest that, as systems grow in size, failures will occur more frequently …