Identify silent data corruption vulnerable instructions using SVM

N Yang, Y Wang - IEEE Access, 2019 - ieeexplore.ieee.org
Silent data corruption (SDC) is the most insidious and harmful result type of soft error.
Identify program vulnerable instructions (PVIns) that are likely to cause SDCs is extremely …

A tale of two injectors: End-to-end comparison of ir-level and assembly-level fault injection

L Palazzi, G Li, B Fang… - 2019 IEEE 30th …, 2019 - ieeexplore.ieee.org
Fault injection (FI) is a commonly used experimental technique to evaluate the resilience of
software techniques for tolerating hardware faults. Software-implemented FI can be …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

Towards end-to-end sdc detection for hpc applications equipped with lossy compression

S Li, S Di, K Zhao, X Liang, Z Chen… - … Conference on Cluster …, 2020 - ieeexplore.ieee.org
Data reduction techniques have been widely demanded and used by large-scale high
performance computing (HPC) applications because of vast volumes of data to be produced …

Zofi: Zero-overhead fault injection tool for fast transient fault coverage analysis

V Porpodas - arXiv preprint arXiv:1906.09390, 2019 - arxiv.org
The experimental evaluation of fault-tolerance studies relies on tools that inject errors while
programs are running, and then monitor the execution and the output for faulty execution. In …

Resilient scheduling of moldable jobs on failure-prone platforms

A Benoit, V Le Fèvre, L Perotin… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
This paper focuses on the resilient scheduling of moldable parallel jobs on high-
performance computing (HPC) platforms. Moldable jobs allow for choosing a processor …

Design and comparison of resilient scheduling heuristics for parallel jobs

A Benoit, V Le Fèvre, P Raghavan… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
This paper focuses on the resilient scheduling of parallel jobs on high-performance
computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit …

Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications

MH Rahman, S Di, S Guo, X Lu, G Li… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …

CARE: Compiler-assisted recovery from soft failures

C Chen, G Eisenhauer, S Pande, Q Guan - Proceedings of the …, 2019 - dl.acm.org
As processors continue to boost the system performance with higher circuit density,
shrinking process technology and near-threshold voltage (NTV) operations, they are …

Resilient scheduling of moldable parallel jobs to cope with silent errors

A Benoit, V Le Fèvre, L Perotin… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
We study the resilient scheduling of moldable parallel jobs on high-performance computing
(HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution …