LADR: Low-cost application-level detector for reducing silent output corruptions

C Chen, G Eisenhauer, M Wolf, S Pande - Proceedings of the 27th …, 2018 - dl.acm.org
Applications running on future high performance computing (HPC) systems are more likely
to experience transient faults due to technology scaling trends with respect to higher circuit …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …

FlipBack: automatic targeted protection against silent data corruption

X Ni, LV Kale - SC'16: Proceedings of the International …, 2016 - ieeexplore.ieee.org
The decreasing size of transistors has been critical to the increase in capacity of
supercomputers. The smaller the transistors are, less energy is required to flip a bit, and thus …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

Detecting silent data corruptions in the wild

HD Dixit, L Boyle, G Vunnam, S Pendharkar… - arXiv preprint arXiv …, 2022 - arxiv.org
Silent Errors within hardware devices occur when an internal defect manifests in a part of the
circuit which does not have check logic to detect the incorrect circuit operation. The results of …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

DisCVar discovering critical variables using algorithmic differentiation for transient faults

H Menon, K Mohror - ACM SIGPLAN Notices, 2018 - dl.acm.org
Aggressive technology scaling trends have made the hardware of high performance
computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent …