Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

Toward general software level silent data corruption detection for parallel applications

E Berrocal, L Bautista-Gomez, S Di… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. Mechanisms have been …

Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
Silent data corruption (SDC) poses a great challenge for high-performance computing
(HPC) applications as we move to extreme-scale systems. If not dealt with properly, SDC …

LADR: Low-cost application-level detector for reducing silent output corruptions

C Chen, G Eisenhauer, M Wolf, S Pande - Proceedings of the 27th …, 2018 - dl.acm.org
Applications running on future high performance computing (HPC) systems are more likely
to experience transient faults due to technology scaling trends with respect to higher circuit …

Neural network based silent error detector

C Wang, N Dryden, F Cappello… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur
more frequently. Such errors can lead to incorrect results. Attempts have been made to use …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …