Predicting the silent data error prone devices using machine learning

ME Shaik, AK Mishra, Y Kim - 2023 IEEE 41st VLSI Test …, 2023 - ieeexplore.ieee.org
Silent Data Errors (SDEs) are a subset of Defective Parts per Million (DPPM) test escapes
that cause unnoticed data corruption. Even at very low levels of DPPM, these are visible at …

Towards increasing the error handling time window in large-scale distributed systems using console and resource usage logs

N Gurumdimma, A Jhumka, M Liakata… - 2015 IEEE Trustcom …, 2015 - ieeexplore.ieee.org
Resource-intensive applications such as scientific applications require the architecture or
system on which they execute to display a very high level of dependability to reduce the …

Understanding scale-dependent soft-error behavior of scientific applications

G Kestor, IB Peng, R Gioiosa… - 2018 18th IEEE/ACM …, 2018 - ieeexplore.ieee.org
Analyzing application fault behavior on large-scale systems is time-consuming and resource-
demanding. Currently, researchers need to perform fault injection campaigns at full scale to …

Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications

O Subasi, O Unsal, S Krishnamoorthy - Proceedings of the Third …, 2017 - dl.acm.org
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-
performance computing (HPC) systems. In this study, we present an automatic, efficient and …

Ed2: A case for active learning in error detection

F Neutatz, M Mahdavi, Z Abedjan - Proceedings of the 28th ACM …, 2019 - dl.acm.org
State-of-the-art approaches formulate error detection as a semi-supervised classification
problem. Recent research suggests that active learning is insufficiently effective for error …

Comparative analysis of redundancy schemes for soft-error detection in low-cost space applications

C Frenkel, JD Legat, D Bol - 2016 IFIP/IEEE International …, 2016 - ieeexplore.ieee.org
Single-Event Effects are an increasingly important issue in electronic circuits due to
technology scaling, efficient error detection schemes are thus required for circuits dedicated …

Optimal resilience patterns to cope with fail-stop and silent errors

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop
errors. Many others deal with silent errors (or silent data corruptions). But very few papers …

Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems

Z Li, H Menon, D Maljovec, Y Livnat… - … on Visualization and …, 2020 - ieeexplore.ieee.org
The trend of rapid technology scaling is expected to make the hardware of high-performance
computing (HPC) systems more susceptible to computational errors due to random bit flips …

Configurable detection of SDC-causing errors in programs

Q Lu, G Li, K Pattabiraman, MS Gupta… - ACM Transactions on …, 2017 - dl.acm.org
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including
embedded systems. However, current protection techniques are brittle and do not allow …

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing

N Yang, Y Wang - Electronics, 2020 - mdpi.com
Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent
data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect …