Fault-tolerant linear solvers via selective reliability

PG Bridges, KB Ferreira, MA Heroux… - arXiv preprint arXiv …, 2012 - arxiv.org
Energy increasingly constrains modern computer hardware, yet protecting computations and
data against errors costs energy. This holds at all scales, but especially for the largest …

A numerical soft fault model for iterative linear solvers

J Elliott, M Hoemmen, F Mueller - Proceedings of the 24th International …, 2015 - dl.acm.org
We present a fault model designed to bring out the" worst" in iterative solvers based on
mathematical properties. Our model introduces substantially higher overhead, but smaller …

Fault tolerance in an inner-outer solver: a GVR-enabled case study

Z Zheng, AA Chien, K Teranishi - … Conference, Eugene, OR, USA, June 30 …, 2015 - Springer
Resilience is a major challenge for large-scale systems. It is particularly important for
iterative linear solvers, since they take much of the time of many scientific applications. We …

Cooperative application/OS DRAM fault recovery

PG Bridges, M Hoemmen, KB Ferreira… - … Conference on Parallel …, 2011 - Springer
Exascale systems will present considerable fault-tolerance challenges to applications and
system software. These systems are expected to suffer several hard and soft errors per day …

[PDF][PDF] Fault-tolerant iterative methods.

MF Hoemmen, MA Heroux - 2011 - osti.gov
Current iterative methods for solving linear equations assume reliability of data and
arithmetic computations. When the computer system violates this assumption, the algorithm …

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

J Sloan, R Kumar, G Bronevetsky - 2013 43rd Annual IEEE/IFIP …, 2013 - ieeexplore.ieee.org
The increasing size and complexity of massively parallel systems (eg HPC systems) is
making it increasingly likely that individual circuits will produce erroneous results. For this …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …

Algorithm based fault tolerance versus result-checking for matrix computations

P Prata, JG Silva - Digest of Papers. Twenty-Ninth Annual …, 1999 - ieeexplore.ieee.org
Algorithm Based Fault Tolerance (ABFT) is the collective name of a set of techniques used to
determine the correctness of some mathematical calculations. A less well known alternative …

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

S Chakraborty, I Laguna, M Emani… - Concurrency and …, 2020 - Wiley Online Library
Scientists from many different fields have been developing Bulk‐Synchronous MPI
applications to simulate and study a wide variety of scientific phenomena. Since failure rates …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey, G Bosilca… - Computing, 2013 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …