An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessment

HA Ahmad, Y Sedaghat - Future Generation Computer Systems, 2024 - Elsevier
The recent trend in most processor manufacturing technologies has significantly increased
the vulnerability of embedded systems operating in harsh environments against soft errors …

Machine learning-based run-time anomaly detection in software systems: An industrial evaluation

F Huch, M Golagha, A Petrovska… - 2018 IEEE Workshop on …, 2018 - ieeexplore.ieee.org
Anomalies are an inevitable occurrence while operating enterprise software systems.
Traditionally, anomalies are detected by threshold-based alarms for critical metrics, or health …

Detecting and correcting data corruption in stencil applications through multivariate interpolation

L Bautista-Gomez, F Cappello - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
High-performance computing is a powerful tool that allows scientists to study complex
natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

Error recovery using forced validity assisted by executable assertions for error detection: An experimental evaluation

M Hiller - … Conference. Informatics: Theory and Practice for the …, 1999 - ieeexplore.ieee.org
This paper proposes and evaluates error detection and recovery mechanisms suitable for
embedded systems. The purpose of these mechanisms is to provide detection of and …

SmartInjector: Exploiting intelligent fault injection for SDC rate analysis

J Li, Q Tan - 2013 IEEE International Symposium on Defect and …, 2013 - ieeexplore.ieee.org
Recently, researchers have shown that exploiting symptom-based solutions provides a
promising way to achieve low-cost fault tolerance. However, these solutions cannot provide …

Δ-encoding: Practical encoded processing

D Kuvaiskii, C Fetzer - 2015 45th Annual IEEE/IFIP …, 2015 - ieeexplore.ieee.org
Transient and permanent errors in memory and CPUs occur with alarming frequency.
Although most of these errors are masked at the hardware level or result in crashes, a non …

Demystifying soft error assessment strategies on arm cpus: Microarchitectural fault injection vs. neutron beam experiments

A Chatzidimitriou, P Bodmann… - 2019 49th Annual …, 2019 - ieeexplore.ieee.org
Fault injection in early microarchitecture-level simulation CPU models and beam
experiments on the final physical CPU chip are two established methodologies to access the …

Quantifying the impact of memory errors in deep learning

Z Zhang, L Huang, R Huang, W Xu… - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
The use of deep learning (DL) on HPC resources has become common as scientists explore
and exploit DL methods to solve domain problems. On the other hand, in the coming …

Detecting Silent Data Corruptions in Aerospace‐Based Computing Using Program Invariants

J Ma, D Yu, Y Wang, Z Cai, Q Zhang… - International Journal of …, 2016 - Wiley Online Library
Soft error caused by single event upset has been a severe challenge to aerospace‐based
computing. Silent data corruption (SDC) is one of the results incurred by soft error. SDC …