[PDF][PDF] Towards transient fault tolerance for heterogeneous computing platforms

N George, J Lach, S Gurumurthi - Proc. Workshop on Compiler and …, 2008 - academia.edu
The computing demands of applications coupled with the power wall problem in modern
processors are expected to pave the way for heterogeneous computing platforms that are …

Model-implemented fault injection for robustness assessment

R Svenningsson - 2011 - diva-portal.org
The complexity of safety-related embedded computer systems is steadily increasing.
Besides verifying that such systems implement the correct functionality, it is essential to …

Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults

SKS Hari, SV Adve, H Naeimi… - ACM SIGARCH …, 2012 - dl.acm.org
Future microprocessors need low-cost solutions for reliable operation in the presence of
failure-prone devices. A promising approach is to detect hardware faults by deploying low …

[HTML][HTML] Fault and timing analysis in critical multi-core systems: A survey with an avionics perspective

A Löfwenmark, S Nadjm-Tehrani - Journal of Systems Architecture, 2018 - Elsevier
With more functionality added to future safety-critical avionics systems, new platforms are
required to offer the computational capacity needed. Multi-core processors offer a potential …

Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

CF Chandler, C Leangsuksun… - Proceedings of the 2009 …, 2009 - dl.acm.org
One predominant barrier encountered in furthering research and development efforts aimed
at facilitating resilient HPC applications is a substantial lack of existing reliability and …

Adapting to intermittent faults in multicore systems

PM Wells, K Chakraborty, GS Sohi - ACM SIGOPS Operating Systems …, 2008 - dl.acm.org
Future multicore processors will be more susceptible to a variety of hardware failures. In
particular, intermittent faults, caused in part by manufacturing, thermal, and voltage …

Measuring the resiliency of extreme-scale computing environments

C Di Martino, Z Kalbarczyk, R Iyer - … and evaluation: Essays in honor of …, 2016 - Springer
This chapter presents a case study on how to characterize the resiliency of large-scale
computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid …

A small-scale testbed for large-scale reliable computing

JRS John - 2016 - search.proquest.com
High performance computing (HPC) systems frequently suffer errors and failures from
hardware components that negatively impact the performance of jobs run on these systems …

[PDF][PDF] Sim-SODA: A unified framework for architectural level software reliability analysis

X Fu, T Li, J Fortes - Workshop on modeling, benchmarking and …, 2006 - ecoms.ee.uh.edu
Semiconductor transient faults (soft errors) are becoming an increasingly critical threat to
reliable software execution. With the advent of the billion transistor chip era, it is impractical …

[PDF][PDF] Adapting to Intermittent Faults in Multicore Systems

K Chakraborty, PM Wells, GS Sohi - Proceedings of the 13th …, 2007 - academia.edu
Future multicore processors will be more susceptible to a variety of hardware failures. In
particular, intermittent faults, caused in part by manufacturing, thermal, and voltage …