Checkpointing workflows for fail-stop errors

L Han, LC Canon, H Casanova… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
We consider the problem of orchestrating the execution of workflow applications structured
as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail …

Multi-level checkpointing and silent error detection for linear workflows

A Benoit, A Cavelan, Y Robert, H Sun - Journal of computational science, 2018 - Elsevier
Abstract We focus on High Performance Computing (HPC) workflows whose dependency
graph forms a linear chain, and we extend single-level checkpointing in two important …

A generic approach to scheduling and checkpointing workflows

L Han, V Le Fèvre, LC Canon, Y Robert… - Proceedings of the 47th …, 2018 - dl.acm.org
This work deals with scheduling and checkpointing strategies to execute scientific workflows
on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to …

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Combining checkpointing and replication for reliable execution of linear workflows with fail-stop and silent errors

A Benoit, A Cavelan, FM Ciorba, V Le Fèvre… - International Journal of …, 2019 - jstage.jst.go.jp
Large-scale platforms currently experience errors from two different sources, namely fail-
stop errors (which interrupt the execution) and silent errors (which strike unnoticed and …

Checkpointing strategies for scheduling computational workflows

G Aupy, A Benoit, H Casanova… - International Journal of …, 2016 - jstage.jst.go.jp
We study the scheduling of computational workflows on compute resources that experience
exponentially distributed failures. When a failure occurs, rollback and recovery is used to …

Efficient checkpoint/verification patterns

A Benoit, SK Raina, Y Robert - The International Journal of …, 2017 - journals.sagepub.com
Errors have become a critical problem for high-performance computing. Checkpointing
protocols are often used for error recovery after fail-stop failures. However, silent errors …

Space-efficient page-level incremental checkpointing

J Heo, S Yi, Y Cho, J Hong, SY Shin - … of the 2005 ACM symposium on …, 2005 - dl.acm.org
Incremental checkpointing, which is intended to minimize checkpointing overhead, saves
only the modified pages of a process. However, the cumulative size of incremental …

Optimal checkpointing strategies for iterative applications

Y Du, L Marchal, G Pallez… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
This work provides an optimal checkpointing strategy to protect iterative applications from
fail-stop errors. We consider a general framework, where the application repeats the same …

Adaptive page-level incremental checkpointing based on expected recovery time

S Yi, J Heo, Y Cho, J Hong - Proceedings of the 2006 ACM symposium …, 2006 - dl.acm.org
Incremental checkpointing, which is intended to minimize checkpointing overhead, saves
only the modified pages of a process. This means that in incremental checkpointing, the time …