BinFI an efficient fault injector for safety-critical machine learning systems

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

A low-cost fault corrector for deep neural networks through range restriction

Z Chen, G Li, K Pattabiraman - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …

An empirical study of the impact of single and multiple bit-flip errors in programs

B Sangchoolie, K Pattabiraman… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Recent studies have shown that technology and voltage scaling are expected to increase
the likelihood that particle-induced soft errors manifest as multiple-bit errors. This raises …

Mitigating silent data corruptions in HPC applications across multiple program inputs

Y Huang, S Guo, S Di, G Li… - … Conference for High …, 2022 - ieeexplore.ieee.org
With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …

Gpu-trident: efficient modeling of error propagation in gpu programs

AR Anwer, G Li, K Pattabiraman… - … Conference for High …, 2020 - ieeexplore.ieee.org
Fault injection (FI) techniques are typically used to determine the reliability profiles of
programs under soft errors. However, these techniques are highly resource-and time …

Resilience assessment of large language models under transient hardware faults

UK Agarwal, A Chan… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Large Language Models (LLMs) are transforming the field of natural language processing
and revolutionizing the way machines interact with humans. LLMs like ChatGPT and …

Characterizing and Improving Resilience of Accelerators to Memory Errors in Autonomous Robots

D Shah, ZY Xue, K Pattabiraman… - ACM Transactions on …, 2024 - dl.acm.org
Motion planning is a computationally intensive and well-studied problem in autonomous
robots. However, motion planning hardware accelerators (MPA) must be soft-error resilient …

Towards a safety case for hardware fault tolerance in convolutional neural networks using activation range supervision

F Geissler, S Qutub, S Roychowdhury, A Asgari… - arXiv preprint arXiv …, 2021 - arxiv.org
Convolutional neural networks (CNNs) have become an established part of numerous safety-
critical computer vision applications, including human robot interactions and automated …

Druto: Upper-bounding silent data corruption vulnerability in gpu applications

MH Rahman, S Di, S Guo, X Lu, G Li… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …

FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication

I Baek, W Chen, Z Zhu, S Samii… - Proceedings of the …, 2022 - openaccess.thecvf.com
Deep neural network (deepnet) applications play a crucial role in safety-critical systems
such as autonomous vehicles (AVs). An AV must drive safely towards its destination …