Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations

D Ma, F Lin, A Desmaison, J Coburn, D Moore… - Proceedings of the 29th …, 2024 - dl.acm.org
Deep neural networks (DNNs) have been widely-adopted in various safety-critical
applications such as computer vision and autonomous driving. However, as technology …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Artificial neural networks for online error detection

V Vassiliadis, K Parasyris, CD Antonopoulos… - arXiv preprint arXiv …, 2021 - arxiv.org
Hardware reliability is adversely affected by the downscaling of semiconductor devices and
the scale-out of systems necessitated by modern applications. Apart from crashes, this …

Ai-lancet: Locating error-inducing neurons to optimize neural networks

Y Zhao, H Zhu, K Chen, S Zhang - Proceedings of the 2021 ACM …, 2021 - dl.acm.org
Deep neural network (DNN) has been widely utilized in many areas due to its increasingly
high accuracy. However, DNN models could also produce wrong outputs due to internal …

GATPS: An attention-based graph neural network for predicting SDC-causing instructions

J Ma, Z Duan, L Tang - 2021 IEEE 39th VLSI Test Symposium …, 2021 - ieeexplore.ieee.org
Soft errors can lead to silent data corruption (SDC), seriously compromising the reliability of
a system. To detect SDC, a profiling of SDC-causing instructions is usually needed to decide …

Deep validation: Toward detecting real-world corner cases for deep neural networks

W Wu, H Xu, S Zhong, MR Lyu… - 2019 49th Annual IEEE …, 2019 - ieeexplore.ieee.org
The exceptional performance of Deep neural networks (DNNs) encourages their
deployment in safety-and dependability-critical systems. However, DNNs often demonstrate …

Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems

Y He, Y Li - 2023 IEEE European Test Symposium (ETS), 2023 - ieeexplore.ieee.org
Hardware failures pose critical threats to deep neural network (DNN) training workloads,
and the urgency of tackling this challenge (known as the Silent Data Corruption challenge in …

Modeling soft-error propagation in programs

G Li, K Pattabiraman, SKS Hari… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …

SDC Error Detection by Exploring the Importance of Instruction Features

W Fang, J Gu, Z Yan, Q Wang - … , WASA 2021, Nanjing, China, June 25–27 …, 2021 - Springer
With the continuous improvement of the integration of semiconductor chips, it has brought
great challenges to the reliability and safety of the system. Among them, Silent Data …