Characterizing and mitigating soft errors in gpu dram

MB Sullivan, N Saxena, M O'Connor, D Lee… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
GPUs are used in high-reliability systems, including high-performance computers and
autonomous vehicles. Because GPUs employ a high-bandwidth, wide-interface to DRAM …

[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Voodoo: Memory Tagging, Authenticated Encryption, and Error Correction through {MAGIC}

L Lamster, M Unterguggenberger… - 33rd USENIX Security …, 2024 - usenix.org
Confidentiality, authenticity, integrity of data, and runtime security are ubiquitous concerns in
modern computer systems. However, these security concerns have traditionally been …

Structural coding: A low-cost scheme to protect cnns from large-granularity memory faults

A Asgari Khoshouyeh, F Geissler, S Qutub… - Proceedings of the …, 2023 - dl.acm.org
The advent of High-Performance Computing has led to the adoption of Convolutional Neural
Networks (CNNs) in safety-critical applications such as autonomous vehicles. However …

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

A case for transparent reliability in DRAM systems

M Patel, T Shahroodi, A Manglik, AG Yaglikci… - arXiv preprint arXiv …, 2022 - arxiv.org
Today's systems have diverse needs that are difficult to address using one-size-fits-all
commodity DRAM. Unfortunately, although system designers can theoretically adapt …

Cost-aware prediction of uncorrected DRAM errors in the field

I Boixaderas, D Zivanovic, S Moré… - … Conference for High …, 2020 - ieeexplore.ieee.org
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading
cause of hardware failures in large-scale HPC clusters. The method uses a random forest …

Rethinking the Producer-Consumer Relationship in Modern DRAM-Based Systems

M Patel, T Shahroodi, A Manglik, AG Yağlıkçı… - IEEE …, 2024 - ieeexplore.ieee.org
Generational improvements to commodity DRAM throughout half a century have long
solidified its prevalence as main memory across the computing industry. However …

Hardware resilience properties of text-guided image classifiers

ST Wasim, KH Soboka, A Mahmoud… - Advances in …, 2024 - proceedings.neurips.cc
This paper presents a novel method to enhance the reliability of image classification models
during deployment in the face of transient hardware errors. By utilizing enriched text …

An in-depth correlative study between DRAM errors and server failures in production data centers

Z Cheng, S Han, PPC Lee, X Li… - 2022 41st International …, 2022 - ieeexplore.ieee.org
Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures
in production data centers. However, little is known about the correlation between DRAM …