A2: Analog malicious hardware

K Yang, M Hicks, Q Dong, T Austin… - 2016 IEEE symposium …, 2016 - ieeexplore.ieee.org
While the move to smaller transistors has been a boon for performance it has dramatically
increased the cost to fabricate chips using those smaller transistors. This forces the vast …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

The case for lifetime reliability-aware microprocessors

J Srinivasan, SV Adve, P Bose, JA Rivers - ACM SIGARCH Computer …, 2004 - dl.acm.org
Ensuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a
critical requirementfor all microprocessor manufacturers. We observethat continuous device …

Shoestring: Probabilistic soft error reliability on the cheap

S Feng, S Gupta, A Ansari, S Mahlke - ACM SIGARCH Computer …, 2010 - dl.acm.org
Aggressive technology scaling provides designers with an ever increasing budget of
cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in …

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults

SKS Hari, SV Adve, H Naeimi… - ACM SIGARCH …, 2012 - dl.acm.org
Future microprocessors need low-cost solutions for reliable operation in the presence of
failure-prone devices. A promising approach is to detect hardware faults by deploying low …

Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory

Y Luo, S Govindan, B Sharma… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

A defect-tolerant accelerator for emerging high-performance applications

O Temam - ACM SIGARCH Computer Architecture News, 2012 - dl.acm.org
Due to the evolution of technology constraints, especially energy constraints which may lead
to heterogeneous multi-cores, and the increasing number of defects, the design of defect …