As CNNs are being extensively employed in high performance and safety-critical applications that demand high reliability, it is important to ensure that they are resilient to …
Confidentiality, authenticity, integrity of data, and runtime security are ubiquitous concerns in modern computer systems. However, these security concerns have traditionally been …
The advent of High-Performance Computing has led to the adoption of Convolutional Neural Networks (CNNs) in safety-critical applications such as autonomous vehicles. However …
D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
Today's systems have diverse needs that are difficult to address using one-size-fits-all commodity DRAM. Unfortunately, although system designers can theoretically adapt …
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest …
Generational improvements to commodity DRAM throughout half a century have long solidified its prevalence as main memory across the computing industry. However …
This paper presents a novel method to enhance the reliability of image classification models during deployment in the face of transient hardware errors. By utilizing enriched text …
Z Cheng, S Han, PPC Lee, X Li… - 2022 41st International …, 2022 - ieeexplore.ieee.org
Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM …