Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC …

Toward general software level silent data corruption detection for parallel applications

E Berrocal, L Bautista-Gomez, S Di… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. Mechanisms have been …

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

G Zhang, Y Liu, H Yang, D Qian - The Journal of Supercomputing, 2022 - Springer
Nowadays, high-performance computing (HPC) is stepping forward to exascale era.
However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous …

Detecting silent data corruption for extreme-scale MPI applications

L Bautista-Gomez, F Cappello - Proceedings of the 22nd European MPI …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. These trends are pushing …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …