Silent data corruptions: Microarchitectural perspectives

G Papadimitriou, D Gizopoulos - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the
major challenge of silent data corruptions (SDCs) and aim on solutions to minimize its …

Error recovery using forced validity assisted by executable assertions for error detection: An experimental evaluation

M Hiller - … Conference. Informatics: Theory and Practice for the …, 1999 - ieeexplore.ieee.org
This paper proposes and evaluates error detection and recovery mechanisms suitable for
embedded systems. The purpose of these mechanisms is to provide detection of and …

Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications

MH Rahman, S Di, S Guo, X Lu, G Li… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …

Defect Mechanisms Responsible for Silent Data Errors

M Shamsa, D Lerner - 2024 IEEE International Reliability …, 2024 - ieeexplore.ieee.org
As the scale of silicon integration increases, and as System-on-Chip (SoC) devices are
installed in datacenters in ever larger numbers, silicon faults that are undetected by the …

A hybrid concurrent error detection scheme for simultaneous improvement on probability of detection and diagnosability

CH Wang, TY Hsieh - … Test Conference in Asia (ITC-Asia), 2017 - ieeexplore.ieee.org
In this work we propose a hybrid concurrent error detection (CED) scheme that combines the
implication-based method with the parity check method. The parity check method is easy to …

Investigating the impact of high-level software design on low-level hardware fault resilience

B Zhang, L Yang, G Li, H Xu - 2023 53rd Annual IEEE/IFIP …, 2023 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) have become an insurmountable issue that threatens the
system reliability. General strategies for protecting programs from SDCs, such as dual …

Data Center Silent Data Errors: Implications to Artificial Intelligence Workloads & Mitigations

B Bittel, M Shamsa, B Inkley, A Gur… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Silent Data Errors (SDEs) are a unique category of errors that result in unpredictable system
behavior that is often difficult to detect. SDEs can represent a serious concern to at-scale …

Reducing due-fit of caches by exploiting acoustic wave detectors for error recovery

G Upasani, X Vera, A González - 2013 IEEE 19th International …, 2013 - ieeexplore.ieee.org
Cosmic radiation induced soft errors have emerged as a key challenge in computer system
design. The exponential increase in the transistor count will drive the per chip fault rate sky …

Towards end-to-end sdc detection for hpc applications equipped with lossy compression

S Li, S Di, K Zhao, X Liang, Z Chen… - … Conference on Cluster …, 2020 - ieeexplore.ieee.org
Data reduction techniques have been widely demanded and used by large-scale high
performance computing (HPC) applications because of vast volumes of data to be produced …

SmartInjector: Exploiting intelligent fault injection for SDC rate analysis

J Li, Q Tan - 2013 IEEE International Symposium on Defect and …, 2013 - ieeexplore.ieee.org
Recently, researchers have shown that exploiting symptom-based solutions provides a
promising way to achieve low-cost fault tolerance. However, these solutions cannot provide …