Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the …
Analyzing application fault behavior on large-scale systems is time-consuming and resource- demanding. Currently, researchers need to perform fault injection campaigns at full scale to …
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high- performance computing (HPC) systems. In this study, we present an automatic, efficient and …
State-of-the-art approaches formulate error detection as a semi-supervised classification problem. Recent research suggests that active learning is insufficiently effective for error …
Single-Event Effects are an increasingly important issue in electronic circuits due to technology scaling, efficient error detection schemes are thus required for circuits dedicated …
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers …
Z Li, H Menon, D Maljovec, Y Livnat… - … on Visualization and …, 2020 - ieeexplore.ieee.org
The trend of rapid technology scaling is expected to make the hardware of high-performance computing (HPC) systems more susceptible to computational errors due to random bit flips …
Q Lu, G Li, K Pattabiraman, MS Gupta… - ACM Transactions on …, 2017 - dl.acm.org
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded systems. However, current protection techniques are brittle and do not allow …
Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect …