Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Infrastructure and api extensions for elastic execution of mpi applications

I Comprés, A Mo-Hellenbrand, M Gerndt… - Proceedings of the 23rd …, 2016 - dl.acm.org
Dynamic Processes support was added to MPI in version 2.0 of the standard. This feature of
MPI has not been widely used by application developers in part due to the performance cost …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

S Chakraborty, I Laguna, M Emani… - Concurrency and …, 2020 - Wiley Online Library
Scientists from many different fields have been developing Bulk‐Synchronous MPI
applications to simulate and study a wide variety of scientific phenomena. Since failure rates …

System-level scalable checkpoint-restart for petascale computing

J Cao, K Arya, R Garg, S Matott… - 2016 IEEE 22nd …, 2016 - ieeexplore.ieee.org
Fault tolerance for the upcoming exascale generation has long been an area of active
research. One of the components of a fault tolerance strategy is checkpointing. Petascale …

Failure recovery for bulk synchronous applications with MPI stages

N Sultana, M Rüfenacht, A Skjellum, I Laguna… - Parallel Computing, 2019 - Elsevier
When an MPI program experiences a failure, the most common recovery approach is to
restart all processes from a previous checkpoint and to re-queue the entire job. A …

[HTML][HTML] DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

C Santana, RCF Araújo, IM Sardina, ÍAS Assis… - Computers & …, 2024 - Elsevier
Many geophysical imaging applications, such as full-waveform inversion, often rely on high-
performance computing to meet their demanding computational requirements. The failure of …

MPI sessions: Evaluation of an implementation in open MPI

N Hjelm, H Pritchard, SK Gutiérrez… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
The recently proposed MPI Sessions extensions to the MPI standard present a new
paradigm for applications to use with MPI. MPI Sessions has the potential to address several …