Letgo: A lightweight continuous framework for hpc applications under failures

B Fang, Q Guan, N Debardeleben… - Proceedings of the 26th …, 2017 - dl.acm.org
Requirements for reliability, low power consumption, and performance place complex and
conflicting demands on the design of high-performance computing (HPC) systems. Fault …

Assessing energy efficiency of fault tolerance protocols for HPC systems

E Meneses, O Sarood, LV Kalé - 2012 IEEE 24th International …, 2012 - ieeexplore.ieee.org
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a
machine will be able to tackle some of the hardest computational problems and to extend …

Color: Co-located rescuers for fault tolerance in hpc systems

Z Hussain, X Cui, T Znati… - 2018 IEEE 24th …, 2018 - ieeexplore.ieee.org
With the increase in scale of HPC systems, the frequency of system wide failures is expected
to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …

SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing

T Ropars, TV Martsinkevich, A Guermouche… - Proceedings of the …, 2013 - dl.acm.org
The high failure rate expected for future supercomputers requires the design of new fault
tolerant solutions. Most checkpointing protocols are designed to work with any message …

Asynchronous checkpointing by dedicated checkpoint threads

F Shahzad, M Wittmann, T Zeiser, G Wellein - Recent Advances in the …, 2012 - Springer
Abstract Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large
HPC applications. Although it is relatively easy as compared to other fault tolerance …

Multi-criteria checkpointing strategies: Response-time versus resource utilization

A Bouteiller, F Cappello, J Dongarra… - Euro-Par 2013 Parallel …, 2013 - Springer
Failures are increasingly threatening the efficiency of HPC systems, and current projections
of Exascale platforms indicate that rollback recovery, the most convenient method for …

Checkpoint/restart in practice: When 'simple is better'

N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org
Efficient use of high-performance computing (HPC) installations critically relies on effective
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …

CRAFT: A library for easier application-level checkpoint/restart and automatic fault tolerance

F Shahzad, J Thies, M Kreutzer, T Zeiser… - … on Parallel and …, 2018 - ieeexplore.ieee.org
In order to efficiently use the future generations of supercomputers, fault tolerance and
power consumption are two of the prime challenges anticipated by the High Performance …

[HTML][HTML] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …