[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Hardware resilience properties of text-guided image classifiers

ST Wasim, KH Soboka, A Mahmoud… - Advances in …, 2024 - proceedings.neurips.cc
This paper presents a novel method to enhance the reliability of image classification models
during deployment in the face of transient hardware errors. By utilizing enriched text …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Arc: An automated approach to resiliency for lossy compressed data via error correcting codes

D Fulp, A Poulos, R Underwood… - Proceedings of the 30th …, 2021 - dl.acm.org
Progress in high-performance computing (HPC) systems has led to complex applications
that stress the I/O subsystem by creating vast amounts of data. Lossy compression reduces …

Neuromorphic computing for scientific applications

R Patton, P Date, S Kulkarni… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org
Neuromorphic computing technology continues to make strides in the development of new
algorithms, devices, and materials. In addition, applications have begun to emerge where …

Understanding failures through the lifetime of a top-level supercomputer

E Rojas, E Meneses, T Jones, D Maxwell - Journal of Parallel and …, 2021 - Elsevier
High performance computing systems are required to solve grand challenges in many
scientific disciplines. These systems assemble many components to be powerful enough for …

Towards scalable and specialized application error analysis

HN Mahmoud - 2020 - ideals.illinois.edu
Modern systems at scale are increasingly susceptible to transient hardware errors at current
technology sizes from natural phenomena such as high-energy particle strikes (also called …

Resolving Soft Error Susceptibilities Within Lossy Compressed HPC Data

D Fulp - 2021 - search.proquest.com
Due to improvements in high-performance computing (HPC) systems, researchers have
created powerful applications capable of solving previously intractable problems. While …