Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org
Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …

Fault tolerance in an inner-outer solver: a GVR-enabled case study

Z Zheng, AA Chien, K Teranishi - … Conference, Eugene, OR, USA, June 30 …, 2015 - Springer
Resilience is a major challenge for large-scale systems. It is particularly important for
iterative linear solvers, since they take much of the time of many scientific applications. We …

Towards resilient parallel linear Krylov solvers: recover-restart strategies

E Agullo, L Giraud, A Guermouche, J Roman… - 2013 - inria.hal.science
The advent of extreme scale machines will require the use of parallel resources at an
unprecedented scale, probably leading to a high rate of hardware faults. High Performance …

A block recycled GMRES method with investigations into aspects of solver performance

ML Parks, KM Soodhalter, DB Szyld - arXiv preprint arXiv:1604.01713, 2016 - arxiv.org
We propose a block Krylov subspace version of the GCRO-DR method proposed in [Parks et
al. SISC 2005], which is an iterative method allowing for the efficient minimization of the the …

Numerical recovery strategies for parallel resilient Krylov linear solvers

E Agullo, L Giraud, A Guermouche… - … Linear Algebra with …, 2016 - Wiley Online Library
As the computational power of high‐performance computing systems continues to increase
by using a huge number of cores or specialized processing units, high‐performance …

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

AJ Peña, W Bland, P Balaji - … of the International Conference for High …, 2015 - dl.acm.org
Popular accelerator programming models rely on offloading computation operations and
their corresponding data transfers to the coprocessors, leveraging synchronization points …

teaMPI—replication-based resilience without the (performance) pain

P Samfass, T Weinzierl, B Hazelwood… - … Conference, ISC High …, 2020 - Springer
In an era where we can not afford to checkpoint frequently, replication is a generic way
forward to construct numerical simulations that can continue to run even if hardware parts …

Local recovery and failure masking for stencil-based applications at extreme scales

M Gamell, K Teranishi, MA Heroux, J Mayo… - Proceedings of the …, 2015 - dl.acm.org
Application resilience is a key challenge that has to be addressed to realize the exascale
vision. Online recovery, even when it involves all processes, can dramatically reduce the …

An error-resilient redundant subspace correction method

T Cui, J Xu, CS Zhang - Computing and Visualization in Science, 2017 - Springer
Due to increasing complexity of supercomputers, hard and soft errors are causing more and
more problems in high-performance scientific and engineering computation. In order to …