Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Reliable on-chip systems in the nano-era: Lessons learnt and future trends

J Henkel, L Bauer, N Dutt, P Gupta, S Nassif… - Proceedings of the 50th …, 2013 - dl.acm.org
Reliability concerns due to technology scaling have been a major focus of researchers and
designers for several technology nodes. Therefore, many new techniques for enhancing and …

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults

SKS Hari, SV Adve, H Naeimi… - ACM SIGARCH …, 2012 - dl.acm.org
Future microprocessors need low-cost solutions for reliable operation in the presence of
failure-prone devices. A promising approach is to detect hardware faults by deploying low …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

Underdesigned and opportunistic computing in presence of hardware variability

P Gupta, Y Agarwal, L Dolecek, N Dutt… - … on Computer-Aided …, 2012 - ieeexplore.ieee.org
Microelectronic circuits exhibit increasing variations in performance, power consumption,
and reliability parameters across the manufactured parts and across use of these parts over …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …

[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Architectures for online error detection and recovery in multicore processors

D Gizopoulos, M Psarakis, SV Adve… - … , Automation & Test …, 2011 - ieeexplore.ieee.org
The huge investment in the design and production of multicore processors may be put at risk
because the emerging highly miniaturized but unreliable fabrication technologies will …

Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency

R Venkatagiri, A Mahmoud, SKS Hari… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org
Approximate computing environments trade off computational accuracy for improvements in
performance, energy, and resiliency cost. For widespread adoption of approximate …