A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Checkpointing workflows for fail-stop errors

L Han, LC Canon, H Casanova… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
We consider the problem of orchestrating the execution of workflow applications structured
as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail …

A method to represent multiple-output switching functions by using multi-valued decision diagrams

T Sasao, JT Butler - … of 26th IEEE International Symposium on …, 1996 - ieeexplore.ieee.org
Multiple-output switching functions can be simulated by multiple-valued decision diagrams
(MDDs) at a significant reduction in computation time. analyze the following approaches to …

Optimal resilience patterns to cope with fail-stop and silent errors

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop
errors. Many others deal with silent errors (or silent data corruptions). But very few papers …

Exploiting spatial smoothness in HPC applications to detect silent data corruption

L Bautista-Gomez, F Cappello - 2015 IEEE 17th International …, 2015 - ieeexplore.ieee.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. This situation is pushing …

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier
This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

A generic approach to scheduling and checkpointing workflows

L Han, V Le Fèvre, LC Canon, Y Robert… - Proceedings of the 47th …, 2018 - dl.acm.org
This work deals with scheduling and checkpointing strategies to execute scientific workflows
on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to …

Design and comparison of resilient scheduling heuristics for parallel jobs

A Benoit, V Le Fèvre, P Raghavan… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
This paper focuses on the resilient scheduling of parallel jobs on high-performance
computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit …