Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolera...

H Li, Z Wang, H Zhao, M Zhang, X Li, H Xu - Future Generation Computer …, 2025 - Elsevier

Abstract Training Deep Learning (DL) models are becoming more time-consuming, thus
interruptions to the training processes are inevitable. We can obtain an optimal …

[PDF] acm.org

Phoenix: A substrate for resilient distributed graph analytics

R Dathathri, G Gill, L Hoang, K Pingali - Proceedings of the Twenty …, 2019 - dl.acm.org

This paper presents Phoenix, a communication and synchronization substrate that
implements a novel protocol for recovering from fail-stop faults when executing graph …

被引用次数：15 相关文章所有 5 个版本

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org

Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

被引用次数：14 相关文章所有 2 个版本

A vision of post-exascale programming

JD Zhai, WG Chen - Frontiers of Information Technology & Electronic …, 2018 - Springer

Exascale systems have been under development for quite some time and will be available
for use in a few years. It is time to think about future post-exascale systems. There are many …

被引用次数：10 相关文章所有 4 个版本

[PDF] arxiv.org

Algorithm-directed crash consistence in non-volatile memory for hpc

S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile
memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …

被引用次数：17 相关文章所有 7 个版本

Co-designing multi-level checkpoint restart for mpi applications

K Parasyris, G Georgakoudis… - 2021 IEEE/ACM 21st …, 2021 - ieeexplore.ieee.org

HPC systems continue to scale by including more hardware components for supporting
larger application deployments. Critically, this scaling tends to decrease the mean time …

被引用次数：4 相关文章所有 2 个版本

[PDF] epfl.ch

Scaling and resilience in numerical algorithms for exascale computing

AS Nielsen - 2018 - infoscience.epfl.ch

The first Petascale supercomputer, the IBM Roadrunner, went online in 2008. Ten years
later, the community is now looking ahead to a new generation of Exascale machines …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

High performance data persistence in non-volatile memory for resilient high performance computing

Y Huang, K Wu, D Li - arXiv preprint arXiv:1705.00264, 2017 - arxiv.org

Resilience is a major design goal for HPC. Checkpoint is the most common method to
enable resilient HPC. Checkpoint periodically saves critical data objects to non-volatile …

被引用次数：3 相关文章所有 2 个版本

[PDF] hal.science

Réplication de données pour la tolérance aux pannes dans un support d'exécution distribué à base de tâches

R Lion - 2022 - theses.hal.science

À mesure que la puissance de calcul des nouveaux supercalculateurs augmente, leur
fiabilitédécroît inexorablement. En effet les limites sont repoussées en augmentant le …

Mitigating I/O impact of checkpointing on large scale parallel systems

N Wang, Q Sun, Y Liu, D Qian - 2018 IEEE 20th International …, 2018 - ieeexplore.ieee.org

Checkpointing is the most widely used technique in high performance computing systems to
tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群