A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …

Anatomy of high-performance gemm with online fault tolerance on gpus

S Wu, Y Zhai, J Liu, J Huang, Z Jian, B Wong… - Proceedings of the 37th …, 2023 - dl.acm.org
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as
machine learning and scientific computing since an efficient GEMM implementation is …

Ft-blas: a high performance blas implementation with online fault tolerance

Y Zhai, E Giem, Q Fan, K Zhao, J Liu… - Proceedings of the ACM …, 2021 - dl.acm.org
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and
machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines …

Checkpointing workflows for fail-stop errors

L Han, LC Canon, H Casanova… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
We consider the problem of orchestrating the execution of workflow applications structured
as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail …

Optimal resilience patterns to cope with fail-stop and silent errors

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop
errors. Many others deal with silent errors (or silent data corruptions). But very few papers …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier
This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

A generic approach to scheduling and checkpointing workflows

L Han, V Le Fèvre, LC Canon, Y Robert… - Proceedings of the 47th …, 2018 - dl.acm.org
This work deals with scheduling and checkpointing strategies to execute scientific workflows
on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to …