An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

[PDF][PDF] Toward effective detection of silent data corruptions for hpc applications

S Di, E Berrocal, L Bautista-Gomez… - Proceedings of the …, 2014 - sc14.supercomputing.org
Because of the large number of components, future extreme-scale systems are expected to
suffer a lot of silent data corruptions. Changes caused by silent errors flipping low-order bit …

Silent data corruption—myth or reality?

C Constantinescu, I Parulkar, R Harper… - … and Networks With …, 2008 - ieeexplore.ieee.org
The higher complexity of the hardware and software employed by modern computing
systems, as well as semiconductor technology scaling, are increasing the likelihood of Silent …

Sdc is in the eye of the beholder: A survey and preliminary study

B Fang, P Wu, Q Guan, N DeBardeleben… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org
Silent data corruptions (SDCs) are one of the most critical issues in modern HPC systems,
as they are" silent" by definition and raise no warnings to users and application developers …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Toward general software level silent data corruption detection for parallel applications

E Berrocal, L Bautista-Gomez, S Di… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. Mechanisms have been …

Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …