Convergence-aware optimal checkpointing for exploratory deep learning training jobs

H Li, Z Wang, H Zhao, M Zhang, X Li, H Xu - Future Generation Computer …, 2025 - Elsevier
Abstract Training Deep Learning (DL) models are becoming more time-consuming, thus
interruptions to the training processes are inevitable. We can obtain an optimal …

Phoenix: A substrate for resilient distributed graph analytics

R Dathathri, G Gill, L Hoang, K Pingali - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents Phoenix, a communication and synchronization substrate that
implements a novel protocol for recovering from fail-stop faults when executing graph …

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

A vision of post-exascale programming

JD Zhai, WG Chen - Frontiers of Information Technology & Electronic …, 2018 - Springer
Exascale systems have been under development for quite some time and will be available
for use in a few years. It is time to think about future post-exascale systems. There are many …

Algorithm-directed crash consistence in non-volatile memory for hpc

S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile
memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …

Co-designing multi-level checkpoint restart for mpi applications

K Parasyris, G Georgakoudis… - 2021 IEEE/ACM 21st …, 2021 - ieeexplore.ieee.org
HPC systems continue to scale by including more hardware components for supporting
larger application deployments. Critically, this scaling tends to decrease the mean time …

Scaling and resilience in numerical algorithms for exascale computing

AS Nielsen - 2018 - infoscience.epfl.ch
The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years
later, the community is now looking ahead to a new generation of Exascale machines …

High performance data persistence in non-volatile memory for resilient high performance computing

Y Huang, K Wu, D Li - arXiv preprint arXiv:1705.00264, 2017 - arxiv.org
Resilience is a major design goal for HPC. Checkpoint is the most common method to
enable resilient HPC. Checkpoint periodically saves critical data objects to non-volatile …

Réplication de données pour la tolérance aux pannes dans un support d'exécution distribué à base de tâches

R Lion - 2022 - theses.hal.science
À mesure que la puissance de calcul des nouveaux supercalculateurs augmente, leur
fiabilitédécroît inexorablement. En effet les limites sont repoussées en augmentant le …

Mitigating I/O impact of checkpointing on large scale parallel systems

N Wang, Q Sun, Y Liu, D Qian - 2018 IEEE 20th International …, 2018 - ieeexplore.ieee.org
Checkpointing is the most widely used technique in high performance computing systems to
tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with …