A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence

N Baker, F Alexander, T Bremer, A Hagberg… - 2019 - osti.gov
Scientific Machine Learning (SciML) and Artificial Intelligence (AI) will have broad use and
transformative effects across the Department of Energy. Accordingly, the January 2018 Basic …

Big data analytics: Machine learning and Bayesian learning perspectives—What is done? What is not?

S Suthaharan - Wiley Interdisciplinary Reviews: Data Mining …, 2019 - Wiley Online Library
Big data analytics provides an interdisciplinary framework that is essential to support the
current trend for solving real‐world problems collaboratively. The progression of big data …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Characterization of the impact of soft errors on iterative methods

BO Mutlu, G Kestor, J Manzano, O Unsal… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Soft errors caused by transient bit flips have the potential to significantly impact an
application's behavior. This has motivated the design of an array of techniques to detect …

Towards end-to-end sdc detection for hpc applications equipped with lossy compression

S Li, S Di, K Zhao, X Liang, Z Chen… - … Conference on Cluster …, 2020 - ieeexplore.ieee.org
Data reduction techniques have been widely demanded and used by large-scale high
performance computing (HPC) applications because of vast volumes of data to be produced …

Predicting the silent data corruption vulnerability of instructions in programs

N Yang, Y Wang - 2019 IEEE 25th International Conference on …, 2019 - ieeexplore.ieee.org
With the decreasing size and voltage level of internal device components, soft errors are
increasing and constitute a major threat on electronic devices. Silent data corruption (SDC) …

Ground-truth prediction to accelerate soft-error impact analysis for iterative methods

BO Mutlu, G Kestor, A Cristal, O Unsal… - 2019 IEEE 26th …, 2019 - ieeexplore.ieee.org
Understanding the impact of soft errors on applications can be expensive. Often, it requires
an extensive error injection campaign involving numerous runs of the full application in the …

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

G Zhang, Y Liu, H Yang, D Qian - The Journal of Supercomputing, 2022 - Springer
Nowadays, high-performance computing (HPC) is stepping forward to exascale era.
However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous …

FPDetect Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

A Das, S Krishnamoorthy, I Briggs… - ACM Transactions on …, 2020 - dl.acm.org
We present FPDetect, a low-overhead approach for detecting logical errors and soft errors
affecting stencil computations without generating false positives. We develop an offline …