Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Exploiting spatial smoothness in HPC applications to detect silent data corruption

L Bautista-Gomez, F Cappello - 2015 IEEE 17th International …, 2015 - ieeexplore.ieee.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. This situation is pushing …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …

Detecting silent data corruption through data dynamic monitoring for scientific applications

L Bautista Gomez, F Cappello - ACM SIGPLAN Notices, 2014 - dl.acm.org
Parallel programming has become one of the best ways to express scientific models that
simulate a wide range of natural phenomena. These complex parallel codes are deployed …

LADR: Low-cost application-level detector for reducing silent output corruptions

C Chen, G Eisenhauer, M Wolf, S Pande - Proceedings of the 27th …, 2018 - dl.acm.org
Applications running on future high performance computing (HPC) systems are more likely
to experience transient faults due to technology scaling trends with respect to higher circuit …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

Neural network based silent error detector

C Wang, N Dryden, F Cappello… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur
more frequently. Such errors can lead to incorrect results. Attempts have been made to use …

Online diagnosis of performance variation in HPC systems using machine learning

O Tuncer, E Ates, Y Zhang, A Turk… - … on Parallel and …, 2018 - ieeexplore.ieee.org
As the size and complexity of high performance computing (HPC) systems grow in line with
advancements in hardware and software technology, HPC systems increasingly suffer from …