Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors

X Dong, SI Yu, X Weng, SE Wei… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we present supervision-by-registration, an unsupervised approach to improve
the precision of facial landmark detectors on both images and video. Our key observation is …

Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …

Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus

J Chen, X Liang, Z Chen - 2016 IEEE International Parallel and …, 2016 - ieeexplore.ieee.org
Extensive researches have been done on developing and optimizing algorithm-based fault
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …

Towards a more complete understanding of SDC propagation

J Calhoun, M Snir, LN Olson, WD Gropp - Proceedings of the 26th …, 2017 - dl.acm.org
With the rate of errors that can silently effect an application's state/output expected to
increase on future HPC machines, numerous application-level detection and recovery …

Detecting and correcting data corruption in stencil applications through multivariate interpolation

L Bautista-Gomez, F Cappello - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
High-performance computing is a powerful tool that allows scientists to study complex
natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher …

Neural network based silent error detector

C Wang, N Dryden, F Cappello… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur
more frequently. Such errors can lead to incorrect results. Attempts have been made to use …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …