Asynchronous and exact forward recovery for detected errors in iterative solvers

L Jaulmes, M Moreto, E Ayguade… - … on Parallel and …, 2018 - ieeexplore.ieee.org
Current trends and projections show that faults in computer systems become increasingly
common. Such errors may be detected, and possibly corrected transparently, eg, by Error …

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

Combining backward and forward recovery to cope with silent errors in iterative solvers

M Fasi, Y Robert, B Uçar - 2015 IEEE International Parallel and …, 2015 - ieeexplore.ieee.org
Several recent papers have introduced a periodic verification mechanism to detect silent
errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a …

Combining algorithm-based fault tolerance and checkpointing for iterative solvers

M Fasi, Y Robert, B Uçar - 2015 - inria.hal.science
Several recent papers have introduced a periodic verification mechanism to detect silent
errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a …

A numerical soft fault model for iterative linear solvers

J Elliott, M Hoemmen, F Mueller - Proceedings of the 24th International …, 2015 - dl.acm.org
We present a fault model designed to bring out the" worst" in iterative solvers based on
mathematical properties. Our model introduces substantially higher overhead, but smaller …

Characterization of the impact of soft errors on iterative methods

BO Mutlu, G Kestor, J Manzano, O Unsal… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Soft errors caused by transient bit flips have the potential to significantly impact an
application's behavior. This has motivated the design of an array of techniques to detect …

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …

Adaptive erasure coded fault tolerant linear system solver

X Kang, DF Gleich, A Sameh, A Grama - ACM Transactions on Parallel …, 2021 - dl.acm.org
As parallel and distributed systems scale, fault tolerance is an increasingly important
problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded …

Rollback-free recovery for a high performance dense linear solver with reduced memory footprint

D Loreti, M Artioli, A Ciampolini - IEEE Transactions on Parallel …, 2024 - ieeexplore.ieee.org
The scale of nowadays High Performance Computing (HPC) systems is the key element that
determines the achievement of impressive performance, as well as the reason for their …