A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Lessons learned from memory errors observed over the lifetime of Cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org
Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

Checkpointing à la Young/Daly: an overview

A Benoit, Y Du, T Herault, L Marchal, G Pallez… - Proceedings of the …, 2022 - dl.acm.org
The Young/Daly formula provides an approximation of the optimal checkpoint period for a
parallel application executing on a supercomputing platform. The Young/Daly formula was …

Improving checkpointing intervals by considering individual job failure probabilities

A Frank, M Baumgartner, R Salkhordeh… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Checkpointing is a popular resilience method in HPC and its efficiency highly depends on
the choice of the checkpoint interval. Standard analytical approaches optimize intervals for …

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms

A Benoit, L Perotin, Y Robert, F Vivien - ACM Transactions on Parallel …, 2024 - dl.acm.org
This article studies checkpointing strategies for parallel applications subject to failures. The
optimal strategy to minimize total execution time, or makespan, is well known when failure …

Reducing resource waste in HPC through co-allocation, custom checkpoints, and lower false failure prediction rates

A Frank - 2022 - openscience.ub.uni-mainz.de
Bigger systems are being deployed by High Performance Computing centers in order to
fulfill the needs of modern scientific and big data applications as well as to match the …

Online fault tolerant task scheduling for real-time multiprocessor embedded systems

P Dobiáš - 2020 - hal.science
The thesis is concerned with online mapping and scheduling of tasks on multiprocessor
embedded systems in order to improve the reliability subject to various constraints regarding …

Checkpointing strategies to protect parallel jobs from non-memoryless fail-stop errors

A Benoit, L Perotin, Y Robert, F Vivien - 2022 - inria.hal.science
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The
optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it …

Scheduling algorithms to optimize the performance, energy consumption and robustness of HPC applications

L Perotin - 2023 - theses.hal.science
This thesis addresses the problem of resilience in large-scale computer systems. Due to the
rapid development of high-performance computing technology, it has become crucial to …