Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft …
Silent data corruptions (SDCs) are one of the most critical issues in modern HPC systems, as they are" silent" by definition and raise no warnings to users and application developers …
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant …
S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …
Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on …
TE Thomas, AJ Bhattad, S Mitra… - 2016 IEEE 35th …, 2016 - ieeexplore.ieee.org
The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating …
With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work …
J Liu, G Agrawal - 2016 IEEE 23rd International Conference on …, 2016 - ieeexplore.ieee.org
Silent data corruption (SDC) from soft errors is one of the challenges for Exascale systems as the number of cores is increasing and the feature size is decreasing. In recent years, a …
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually …