J Elliott, M Hoemmen, F Mueller - Proceedings of the 24th International …, 2015 - dl.acm.org
We present a fault model designed to bring out the" worst" in iterative solvers based on mathematical properties. Our model introduces substantially higher overhead, but smaller …
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We …
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day …
Current iterative methods for solving linear equations assume reliability of data and arithmetic computations. When the computer system violates this assumption, the algorithm …
J Sloan, R Kumar, G Bronevetsky - 2013 43rd Annual IEEE/IFIP …, 2013 - ieeexplore.ieee.org
The increasing size and complexity of massively parallel systems (eg HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this …
Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …
P Prata, JG Silva - Digest of Papers. Twenty-Ninth Annual …, 1999 - ieeexplore.ieee.org
Algorithm Based Fault Tolerance (ABFT) is the collective name of a set of techniques used to determine the correctness of some mathematical calculations. A less well known alternative …
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates …
As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by …