teaMPI—replication-based resilience without the (performance) pain

P Samfass, T Weinzierl, B Hazelwood… - … Conference, ISC High …, 2020 - Springer
In an era where we can not afford to checkpoint frequently, replication is a generic way
forward to construct numerical simulations that can continue to run even if hardware parts …

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de
This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Performance efficient multiresilience using checkpoint recovery in iterative algorithms

RA Ashraf, C Engelmann - Euro-Par 2018: Parallel Processing Workshops …, 2019 - Springer
In this paper, we address the design challenge of building multiresilient iterative high-
performance computing (HPC) applications. Multiresilience in HPC applications is the ability …

Using replication for resilience on exascale systems

H Casanova, F Vivien, D Zaidouni - Fault-Tolerance Techniques for High …, 2015 - Springer
High-performance computing applications must be resilient to faults. The traditional fault
tolerance solution is checkpoint–recovery, by which application state is saved to and …

Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead

H Sun, A Gainaru, M Shantharam… - 2020 IEEE 32nd …, 2020 - ieeexplore.ieee.org
The increasing scale and complexity of today's high-performance computing (HPC) systems
demand a renewed focus on enhancing the resilience of long-running scientific applications …

Keeping checkpoint/restart viable for exascale systems

K Ferreira - 2011 - digitalrepository.unm.edu
Next-generation exascale systems, those capable of performing a quintillion operations per
second, are expected to be delivered in the next 8-10 years. These systems, which will be …

Power-aware resilience for exascale computing

B Mills - 2014 - search.proquest.com
To enable future scientific breakthroughs and discoveries, the next generation of scientific
applications will require exascale computing performance to support the execution of …

Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony

H Kolla, JR Mayo, K Teranishi… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Benefits of local recovery (restarting only a failed process or task) have been previously
demonstrated in parallel solvers. Local recovery has a reduced impact on application …

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …