Fault tolerance in an inner-outer solver: a GVR-enabled case study

Z Zheng, AA Chien, K Teranishi - … Conference, Eugene, OR, USA, June 30 …, 2015 - Springer
Resilience is a major challenge for large-scale systems. It is particularly important for
iterative linear solvers, since they take much of the time of many scientific applications. We …

Fault-tolerant linear solvers via selective reliability

PG Bridges, KB Ferreira, MA Heroux… - arXiv preprint arXiv …, 2012 - arxiv.org
Energy increasingly constrains modern computer hardware, yet protecting computations and
data against errors costs energy. This holds at all scales, but especially for the largest …

[图书][B] Resilient Iterative Linear Solvers Running Through Errors

JJ Elliott III - 2015 - search.proquest.com
Future extreme-scale computer systems may expose incorrect behavior to applications, in
order to save energy or increase performance. However, resilience research struggles to …

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

A numerical soft fault model for iterative linear solvers

J Elliott, M Hoemmen, F Mueller - Proceedings of the 24th International …, 2015 - dl.acm.org
We present a fault model designed to bring out the" worst" in iterative solvers based on
mathematical properties. Our model introduces substantially higher overhead, but smaller …

Asynchronous and exact forward recovery for detected errors in iterative solvers

L Jaulmes, M Moreto, E Ayguade… - … on Parallel and …, 2018 - ieeexplore.ieee.org
Current trends and projections show that faults in computer systems become increasingly
common. Such errors may be detected, and possibly corrected transparently, eg, by Error …

Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead

H Sun, A Gainaru, M Shantharam… - 2020 IEEE 32nd …, 2020 - ieeexplore.ieee.org
The increasing scale and complexity of today's high-performance computing (HPC) systems
demand a renewed focus on enhancing the resilience of long-running scientific applications …

[PDF][PDF] Error checking and snapshot-based recovery in a preconditioned conjugate gradient solver

Z Rubenstein, H Fujita, Z Zheng… - Technical Report TR …, 2013 - newtraell.cs.uchicago.edu
Soft errors are a significant concern for highperformance computing systems in the exascale
time frame. We apply our group's Global View Resilience (GVR) library to a preconditioned …

Towards resilient parallel linear Krylov solvers: recover-restart strategies

E Agullo, L Giraud, A Guermouche, J Roman… - 2013 - inria.hal.science
The advent of extreme scale machines will require the use of parallel resources at an
unprecedented scale, probably leading to a high rate of hardware faults. High Performance …

Hard faults and soft-errors: possible numerical remedies in linear algebra solvers

E Agullo, S Cools, L Giraud, A Moreau, P Salas… - … Science–VECPAR 2016 …, 2017 - Springer
On future large-scale systems, the mean time between failures (MTBF) of the system is
expected to decrease so that many faults could occur during the solution of large problems …