Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Fault-tolerant MPI

A Bouteiller - Fault-Tolerance Techniques for High-Performance …, 2015 - Springer
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI standard remains distressingly vague on the consequence of failures …

Checkpoint/restart approaches for a thread-based MPI runtime

J Adam, M Kermarquer, JB Besnard… - Parallel Computing, 2019 - Elsevier
Fault-tolerance has always been an important topic when it comes to running massively
parallel programs at scale. Statistically, hardware and software failures are expected to …

Resilient MPI applications using an application-level checkpointing framework and ULFM

N Losada, I Cores, MJ Martín, P González - The Journal of …, 2017 - Springer
Future exascale systems, formed by millions of cores, will present high failure rates, and
long-running applications will need to make use of new fault tolerance techniques to ensure …

Transparent high-speed network checkpoint/restart in mpi

J Adam, JB Besnard, AD Malony, S Shende… - Proceedings of the 25th …, 2018 - dl.acm.org
Fault-tolerance has always been an important topic when it comes to running massively
parallel programs at scale. Statistically, hardware and software failures are expected to …

[PDF][PDF] A global exception fault tolerance model for MPI

I Laguna, T Gamblin, K Mohror, M Schulz, H Pritchard… - 2014 - osti.gov
Driven both by the anticipated hardware reliability constraints for exascale systems, and the
desire to use MPI in a broader application space, there is an ongoing effort to incorporate …

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

W Bland, P Du, A Bouteiller, T Herault… - Concurrency and …, 2013 - Wiley Online Library
Most predictions of exascale machines picture billion ways parallelism, encompassing not
only millions of cores but also tens of thousands of nodes. Even considering extremely …

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

Toward local failure local recovery resilience model using MPI-ULFM

K Teranishi, MA Heroux - Proceedings of the 21st european mpi users' …, 2014 - dl.acm.org
The current system reaction to the loss of a single MPI process is to kill all the remaining
processes and restart the application from the most recent checkpoint. This approach will …