Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Towards optimal multi-level checkpointing

A Benoit, A Cavelan, V Le Fèvre… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
We provide a framework to analyze multi-level checkpointing protocols, by formally defining
a-level checkpointing pattern. We provide a first-order approximation to the optimal …

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier
This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

Multi-level checkpointing and silent error detection for linear workflows

A Benoit, A Cavelan, Y Robert, H Sun - Journal of computational science, 2018 - Elsevier
Abstract We focus on High Performance Computing (HPC) workflows whose dependency
graph forms a linear chain, and we extend single-level checkpointing in two important …

Coping with recall and precision of soft error detectors

L Bautista-Gomez, A Benoit, A Cavelan… - Journal of Parallel and …, 2016 - Elsevier
Many methods are available to detect silent errors in high-performance computing (HPC)
applications. Each method comes with a cost, a recall (fraction of all errors that are actually …

Addressing data resiliency for staging based scientific workflows

S Duan, P Subedi, PE Davis, M Parashar - Proceedings of the …, 2019 - dl.acm.org
As applications move towards extreme scales, data-related challenges are becoming
significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data …

Combining checkpointing and replication for reliable execution of linear workflows with fail-stop and silent errors

A Benoit, A Cavelan, FM Ciorba, V Le Fèvre… - International Journal of …, 2019 - jstage.jst.go.jp
Large-scale platforms currently experience errors from two different sources, namely fail-
stop errors (which interrupt the execution) and silent errors (which strike unnoticed and …

Coalescing and deduplicating incremental checkpoint files for restore-express multi-level checkpointing

P Sigdel, NF Tzeng - IEEE Transactions on Parallel and …, 2018 - ieeexplore.ieee.org
In multicore systems, a large portion of checkpoint time overhead can be hidden from the
execution critical path by resorting to a dedicated checkpointing thread run concurrently with …

Combining checkpointing and replication for reliable execution of linear workflows

A Benoit, A Cavelan, FM Ciorba… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
This paper combines checkpointing and replication for the reliable execution of linear
workflows. While both methods have been studied separately, their combination has not yet …

Coping with silent errors in HPC applications

G Aupy, A Benoit, A Cavelan, M Fasi, Y Robert… - … : A Festschrift for Selim G …, 2017 - Springer
This chapter describes a unified framework for the detection and correction of silent errors,
which constitute a major threat for scientific applications at extreme-scale. We first motivate …