The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors

X Dong, SI Yu, X Weng, SE Wei… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we present supervision-by-registration, an unsupervised approach to improve
the precision of facial landmark detectors on both images and video. Our key observation is …

Anomaly detection and anticipation in high performance computing systems

A Borghesi, M Molan, M Milano… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly
becoming larger and more complex, together with the issues concerning their maintenance …

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …

Correcting soft errors online in fast fourier transform

X Liang, J Chen, D Tao, S Li, P Wu, H Li… - Proceedings of the …, 2017 - dl.acm.org
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …

Anatomy of high-performance gemm with online fault tolerance on gpus

S Wu, Y Zhai, J Liu, J Huang, Z Jian, B Wong… - Proceedings of the 37th …, 2023 - dl.acm.org
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as
machine learning and scientific computing since an efficient GEMM implementation is …

Neural network based silent error detector

C Wang, N Dryden, F Cappello… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
As we move toward exascale platforms, silent data corruptions (SDC) are likely to occur
more frequently. Such errors can lead to incorrect results. Attempts have been made to use …