Adaptive erasure coded fault tolerant linear system solver

X Kang, DF Gleich, A Sameh, A Grama - ACM Transactions on Parallel …, 2021 - dl.acm.org
As parallel and distributed systems scale, fault tolerance is an increasingly important
problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded …

Distributed fault tolerant linear system solvers based on erasure coding

X Kang, DF Gleich, A Sameh… - 2017 IEEE 37th …, 2017 - ieeexplore.ieee.org
We present efficient coding schemes and distributed implementations of erasure coded
linear system solvers. Erasure coded computations belong to the class of algorithmic fault …

Erasure coding for fault-oblivious linear system solvers

Y Zhu, DF Gleich, A Grama - SIAM Journal on Scientific Computing, 2017 - SIAM
Dealing with faults is an important problem as parallel and distributed systems scale to
millions of processing cores. Traditional methods for dealing with faults include checkpoint …

Fault tolerant high performance computing by a coding approach

Z Chen, GE Fagg, E Gabriel, J Langou… - Proceedings of the …, 2005 - dl.acm.org
As the number of processors in today's high performance computers continues to grow, the
mean-time-to-failure of these computers are becoming significantly shorter than the …

Asynchronous and exact forward recovery for detected errors in iterative solvers

L Jaulmes, M Moreto, E Ayguade… - … on Parallel and …, 2018 - ieeexplore.ieee.org
Current trends and projections show that faults in computer systems become increasingly
common. Such errors may be detected, and possibly corrected transparently, eg, by Error …

Rollback-free recovery for a high performance dense linear solver with reduced memory footprint

D Loreti, M Artioli, A Ciampolini - IEEE Transactions on Parallel …, 2024 - ieeexplore.ieee.org
The scale of nowadays High Performance Computing (HPC) systems is the key element that
determines the achievement of impressive performance, as well as the reason for their …

Scalable and fault tolerant computation with the sparse grid combination technique

B Harding, M Hegland, J Larson, J Southern - arXiv preprint arXiv …, 2014 - arxiv.org
This paper continues to develop a fault tolerant extension of the sparse grid combination
technique recently proposed in [B. Harding and M. Hegland, ANZIAM J., 54 (CTAC2012), pp …

Solving linear systems on high performance hardware with resilience to multiple hard faults

D Loreti, M Artioli, A Ciampolini - … International Symposium on …, 2020 - ieeexplore.ieee.org
As large-scale linear equation systems are pervasive in many scientific fields, great efforts
have been done over the last decade in realizing efficient techniques to solve such systems …

Fault-tolerant linear solvers via selective reliability

PG Bridges, KB Ferreira, MA Heroux… - arXiv preprint arXiv …, 2012 - arxiv.org
Energy increasingly constrains modern computer hardware, yet protecting computations and
data against errors costs energy. This holds at all scales, but especially for the largest …

Multi-fault tolerance for cartesian data distributions

N Ali, S Krishnamoorthy, M Halappanavar… - International Journal of …, 2013 - Springer
Faults are expected to play an increasingly important role in how algorithms and
applications are designed to run on future extreme-scale systems. Algorithm-based fault …