Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

Understanding the propagation of hard errors to software and implications for resilient system design

ML Li, P Ramachandran, SK Sahoo, SV Adve… - ACM Sigplan …, 2008 - dl.acm.org
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-
the-field faults. To be broadly deployable, the hardware reliability solution must incur low …

Automatic instruction-level software-only recovery

GA Reis, J Chang, DI August - IEEE micro, 2007 - ieeexplore.ieee.org
Software-only reliability techniques protect against transient faults without the overhead of
hardware techniques. Although existing low-level software-only fault-tolerance techniques …

Automatic instruction-level software-only recovery

J Chang, GA Reis, DI August - International Conference on …, 2006 - ieeexplore.ieee.org
As chip densities and clock rates increase, processors are becoming more susceptible to
transient faults that can affect program correctness. Computer architects have typically …

eWASM: Practical Software Fault Isolation for Reliable Embedded Devices

G Peach, R Pan, Z Wu, G Parmer… - … on Computer-Aided …, 2020 - ieeexplore.ieee.org
As we connect more microcontrollers to the Internet and employ them to control the physical
world around us, their reliability and security are increasingly important. Many …

Encore: low-cost, fine-grained transient fault recovery

S Feng, S Gupta, A Ansari, SA Mahlke… - Proceedings of the 44th …, 2011 - dl.acm.org
To meet an insatiable consumer demand for greater performance at less power, silicon
technology has scaled to unprecedented dimensions. However, the pursuit of faster …

Software-controlled fault tolerance

GA Reis, J Chang, N Vachharajani, R Rangan… - ACM Transactions on …, 2005 - dl.acm.org
Traditional fault-tolerance techniques typically utilize resources ineffectively because they
cannot adapt to the changing reliability and performance demands of a system. This paper …

Assure: automatic software self-healing using rescue points

S Sidiroglou, O Laadan, C Perez, N Viennot… - ACM SIGARCH …, 2009 - dl.acm.org
Software failures in server applications are a significant problem for preserving system
availability. We present ASSURE, a system that introduces rescue points that recover …

ReStore: Symptom-based soft error detection in microprocessors

NJ Wang, SJ Patel - IEEE Transactions on Dependable and …, 2006 - ieeexplore.ieee.org
Device scaling and large-scale integration have led to growing concerns about soft errors in
microprocessors. To date, in all but the most demanding applications, implementing parity …

Fine-grained fault tolerance using device checkpoints

A Kadav, MJ Renzelmann, MM Swift - ACM SIGPLAN Notices, 2013 - dl.acm.org
Recovering faults in drivers is difficult compared to other code because their state is spread
across both memory and a device. Existing driver fault-tolerance mechanisms either restart …