Evaluating and accelerating high-fidelity error injection for hpc

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org

As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

被引用次数：123 相关文章所有 8 个版本

[PDF] arxiv.org

A low-cost fault corrector for deep neural networks through range restriction

Z Chen, G Li, K Pattabiraman - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org

The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …

被引用次数：108 相关文章所有 7 个版本

An empirical study of the impact of single and multiple bit-flip errors in programs

B Sangchoolie, K Pattabiraman… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Recent studies have shown that technology and voltage scaling are expected to increase
the likelihood that particle-induced soft errors manifest as multiple-bit errors. This raises …

被引用次数：33 相关文章所有 3 个版本

[PDF] github.io

Mitigating silent data corruptions in HPC applications across multiple program inputs

Y Huang, S Guo, S Di, G Li… - … Conference for High …, 2022 - ieeexplore.ieee.org

With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …

被引用次数：12 相关文章所有 7 个版本

[PDF] nvidia.com

Gpu-trident: efficient modeling of error propagation in gpu programs

AR Anwer, G Li, K Pattabiraman… - … Conference for High …, 2020 - ieeexplore.ieee.org

Fault injection (FI) techniques are typically used to determine the reliability profiles of
programs under soft errors. However, these techniques are highly resource-and time …

被引用次数：27 相关文章所有 6 个版本

Resilience assessment of large language models under transient hardware faults

UK Agarwal, A Chan… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org

Large Language Models (LLMs) are transforming the field of natural language processing
and revolutionizing the way machines interact with humans. LLMs like ChatGPT and …

被引用次数：8 相关文章所有 2 个版本

[PDF] mit.edu

Characterizing and Improving Resilience of Accelerators to Memory Errors in Autonomous Robots

D Shah, ZY Xue, K Pattabiraman… - ACM Transactions on …, 2024 - dl.acm.org

Motion planning is a computationally intensive and well-studied problem in autonomous
robots. However, motion planning hardware accelerators (MPA) must be soft-error resilient …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Towards a safety case for hardware fault tolerance in convolutional neural networks using activation range supervision

F Geissler, S Qutub, S Roychowdhury, A Asgari… - arXiv preprint arXiv …, 2021 - arxiv.org

Convolutional neural networks (CNNs) have become an established part of numerous safety-
critical computer vision applications, including human robot interactions and automated …

被引用次数：18 相关文章所有 3 个版本

Druto: Upper-bounding silent data corruption vulnerability in gpu applications

MH Rahman, S Di, S Guo, X Lu, G Li… - 2024 IEEE …, 2024 - ieeexplore.ieee.org

Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …

被引用次数：1 相关文章所有 5 个版本

[PDF] thecvf.com

FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication

I Baek, W Chen, Z Zhu, S Samii… - Proceedings of the …, 2022 - openaccess.thecvf.com

Deep neural network (deepnet) applications play a crucial role in safety-critical systems
such as autonomous vehicles (AVs). An AV must drive safely towards its destination …

被引用次数：9 相关文章所有 5 个版本

高级搜索

QQ 群