Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

Toward general software level silent data corruption detection for parallel applications

E Berrocal, L Bautista-Gomez, S Di… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. Mechanisms have been …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Mitigating silent data corruptions in HPC applications across multiple program inputs

Y Huang, S Guo, S Di, G Li… - … Conference for High …, 2022 - ieeexplore.ieee.org
With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …

Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …