Silent data corruptions: The stealthy saboteurs of digital integrity

G Papadimitriou, D Gizopoulos… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) pose a significant threat to the integrity of digital systems.
These stealthy saboteurs silently corrupt data, remaining undetected by traditional error …

Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements

D Gizopoulos, G Papadimitriou… - 2024 IEEE European …, 2024 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) due to defects in computing chips (CPUs, GPUs, AI
accelerators) is a critical threat to the quality of large-scale computing in different application …

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

Quantifying the impact of memory errors in deep learning

Z Zhang, L Huang, R Huang, W Xu… - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
The use of deep learning (DL) on HPC resources has become common as scientists explore
and exploit DL methods to solve domain problems. On the other hand, in the coming …

Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply

J Calhoun, M Snir, L Olson… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
With the rate of errors that silently effect an application's state/output expected to increase in
future HPC machines, numerous mitigation schemes have been proposed, but little work …

FlipBack: automatic targeted protection against silent data corruption

X Ni, LV Kale - SC'16: Proceedings of the International …, 2016 - ieeexplore.ieee.org
The decreasing size of transistors has been critical to the increase in capacity of
supercomputers. The smaller the transistors are, less energy is required to flip a bit, and thus …

Which verification for soft error detection?

L Bautista-Gomez, A Benoit, A Cavelan… - 2015 IEEE 22nd …, 2015 - ieeexplore.ieee.org
Many methods are available to detect silent errors in high-performance computing (HPC)
applications. Each comes with a given cost and recall (fraction of all errors that are actually …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Detecting silent data corruptions in the wild

HD Dixit, L Boyle, G Vunnam, S Pendharkar… - arXiv preprint arXiv …, 2022 - arxiv.org
Silent Errors within hardware devices occur when an internal defect manifests in a part of the
circuit which does not have check logic to detect the incorrect circuit operation. The results of …