Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that …
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …
SKS Hari, T Tsai, M Stephenson… - … Analysis of Systems …, 2017 - ieeexplore.ieee.org
As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce higher soft error rates. This trend makes reliability a primary design constraint for computer …
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and high-performance computing systems. As such systems require high levels of resilience to …
This article provides an overview of AMD's vision for exascale computing, and in particular, how heterogeneity will play a central role in realizing this vision. Exascale computing …
Application execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must …
The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …
Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …