Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest …
Future extreme-scale computer systems may expose incorrect behavior to applications, in order to save energy or increase performance. However, resilience research struggles to …
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity …
J Elliott, M Hoemmen, F Mueller - Proceedings of the 24th International …, 2015 - dl.acm.org
We present a fault model designed to bring out the" worst" in iterative solvers based on mathematical properties. Our model introduces substantially higher overhead, but smaller …
Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, eg, by Error …
The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications …
Z Rubenstein, H Fujita, Z Zheng… - Technical Report TR …, 2013 - newtraell.cs.uchicago.edu
Soft errors are a significant concern for highperformance computing systems in the exascale time frame. We apply our group's Global View Resilience (GVR) library to a preconditioned …
E Agullo, L Giraud, A Guermouche, J Roman… - 2013 - inria.hal.science
The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. High Performance …
On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large problems …