Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

G Zhang, Y Liu, H Yang, D Qian - The Journal of Supercomputing, 2022 - Springer
Nowadays, high-performance computing (HPC) is stepping forward to exascale era.
However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous …

Toward general software level silent data corruption detection for parallel applications

E Berrocal, L Bautista-Gomez, S Di… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. Mechanisms have been …

Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

[PDF][PDF] Toward effective detection of silent data corruptions for hpc applications

S Di, E Berrocal, L Bautista-Gomez… - Proceedings of the …, 2014 - sc14.supercomputing.org
Because of the large number of components, future extreme-scale systems are expected to
suffer a lot of silent data corruptions. Changes caused by silent errors flipping low-order bit …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Bi-source verification against silent data corruption in high performance computing

EA Krluku, M Gusev, V Zdraveski - … of the 9th Balkan Conference on …, 2019 - dl.acm.org
This paper proposes a continuous health-check approach for detecting Silent Data
Corruption (SCD) in High Performance Computing (HPC) environments. The goal is to …