Cooperative application/OS DRAM fault recovery

PG Bridges, M Hoemmen, KB Ferreira… - … Conference on Parallel …, 2011 - Springer
Exascale systems will present considerable fault-tolerance challenges to applications and
system software. These systems are expected to suffer several hard and soft errors per day …

Fault-tolerant linear solvers via selective reliability

PG Bridges, KB Ferreira, MA Heroux… - arXiv preprint arXiv …, 2012 - arxiv.org
Energy increasingly constrains modern computer hardware, yet protecting computations and
data against errors costs energy. This holds at all scales, but especially for the largest …

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

A survey of fault-tolerance and fault-recovery techniques in parallel systems

M Treaster - arXiv preprint cs/0501002, 2005 - arxiv.org
Supercomputing systems today often come in the form of large numbers of commodity
systems linked together into a computing cluster. These systems, like any distributed system …

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

Egida: An extensible toolkit for low-overhead fault-tolerance

S Rao, L Alvisi, HM Vin - Digest of Papers. Twenty-Ninth Annual …, 1999 - ieeexplore.ieee.org
We discuss the design and implementation of Egida, an object-oriented toolkit designed to
support transparent rollback-recovery. Egida exports a simple specification language that …

Fault tolerance in an inner-outer solver: a GVR-enabled case study

Z Zheng, AA Chien, K Teranishi - … Conference, Eugene, OR, USA, June 30 …, 2015 - Springer
Resilience is a major challenge for large-scale systems. It is particularly important for
iterative linear solvers, since they take much of the time of many scientific applications. We …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

FTPA: Supporting fault-tolerant parallel computing through parallel recomputing

X Yang, Y Du, P Wang, H Fu… - IEEE Transactions on …, 2008 - ieeexplore.ieee.org
As the size of large-scale computer systems increases, their mean-time-between-failures are
becoming significantly shorter than the execution time of many current scientific applications …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …