X Kang, DF Gleich, A Sameh… - 2017 IEEE 37th …, 2017 - ieeexplore.ieee.org
We present efficient coding schemes and distributed implementations of erasure coded linear system solvers. Erasure coded computations belong to the class of algorithmic fault …
Y Zhu, DF Gleich, A Grama - SIAM Journal on Scientific Computing, 2017 - SIAM
Dealing with faults is an important problem as parallel and distributed systems scale to millions of processing cores. Traditional methods for dealing with faults include checkpoint …
As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the …
Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, eg, by Error …
D Loreti, M Artioli, A Ciampolini - IEEE Transactions on Parallel …, 2024 - ieeexplore.ieee.org
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their …
This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J., 54 (CTAC2012), pp …
D Loreti, M Artioli, A Ciampolini - … International Symposium on …, 2020 - ieeexplore.ieee.org
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems …
Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest …
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault …