Exploiting spatial smoothness in HPC applications to detect silent data corruption

L Bautista-Gomez, F Cappello - 2015 IEEE 17th International …, 2015 - ieeexplore.ieee.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. This situation is pushing …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Sdc is in the eye of the beholder: A survey and preliminary study

B Fang, P Wu, Q Guan, N DeBardeleben… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org
Silent data corruptions (SDCs) are one of the most critical issues in modern HPC systems,
as they are" silent" by definition and raise no warnings to users and application developers …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Sirius: Neural network based probabilistic assertions for detecting silent data corruption in parallel programs

TE Thomas, AJ Bhattad, S Mitra… - 2016 IEEE 35th …, 2016 - ieeexplore.ieee.org
The size and complexity of supercomputing clusters are rapidly increasing to cater to the
needs of complex scientific applications. At the same time, the feature size and operating …

Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply

J Calhoun, M Snir, L Olson… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
With the rate of errors that silently effect an application's state/output expected to increase in
future HPC machines, numerous mitigation schemes have been proposed, but little work …

Soft error detection for iterative applications using offline training

J Liu, G Agrawal - 2016 IEEE 23rd International Conference on …, 2016 - ieeexplore.ieee.org
Silent data corruption (SDC) from soft errors is one of the challenges for Exascale systems
as the number of cores is increasing and the feature size is decreasing. In recent years, a …

Which verification for soft error detection?

L Bautista-Gomez, A Benoit, A Cavelan… - 2015 IEEE 22nd …, 2015 - ieeexplore.ieee.org
Many methods are available to detect silent errors in high-performance computing (HPC)
applications. Each comes with a given cost and recall (fraction of all errors that are actually …