Assuming failure independence: are we right to be wrong?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

被引用次数：1 相关文章所有 3 个版本

[PDF] osti.gov

Lessons learned from memory errors observed over the lifetime of Cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org

Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

被引用次数：43 相关文章所有 6 个版本

[PDF] hal.science

Checkpointing à la Young/Daly: an overview

A Benoit, Y Du, T Herault, L Marchal, G Pallez… - Proceedings of the …, 2022 - dl.acm.org

The Young/Daly formula provides an approximation of the optimal checkpoint period for a
parallel application executing on a supercomputing platform. The Young/Daly formula was …

被引用次数：3 相关文章所有 14 个版本

[PDF] salkhordeh.de

Improving checkpointing intervals by considering individual job failure probabilities

A Frank, M Baumgartner, R Salkhordeh… - 2021 IEEE …, 2021 - ieeexplore.ieee.org

Checkpointing is a popular resilience method in HPC and its efficiency highly depends on
the choice of the checkpoint interval. Standard analytical approaches optimize intervals for …

被引用次数：13 相关文章所有 5 个版本

[PDF] acm.org

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

被引用次数：18 相关文章所有 17 个版本

[PDF] hal.science

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms

A Benoit, L Perotin, Y Robert, F Vivien - ACM Transactions on Parallel …, 2024 - dl.acm.org

This article studies checkpointing strategies for parallel applications subject to failures. The
optimal strategy to minimize total execution time, or makespan, is well known when failure …

被引用次数：2 相关文章所有 5 个版本

[PDF] uni-mainz.de

Reducing resource waste in HPC through co-allocation, custom checkpoints, and lower false failure prediction rates

A Frank - 2022 - openscience.ub.uni-mainz.de

Bigger systems are being deployed by High Performance Computing centers in order to
fulfill the needs of modern scientific and big data applications as well as to match the …

被引用次数：1 相关文章所有 2 个版本

[PDF] hal.science

Online fault tolerant task scheduling for real-time multiprocessor embedded systems

P Dobiáš - 2020 - hal.science

The thesis is concerned with online mapping and scheduling of tasks on multiprocessor
embedded systems in order to improve the reliability subject to various constraints regarding …

被引用次数：3 相关文章所有 4 个版本

[PDF] hal.science

Checkpointing strategies to protect parallel jobs from non-memoryless fail-stop errors

A Benoit, L Perotin, Y Robert, F Vivien - 2022 - inria.hal.science

This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The
optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it …

被引用次数：2 相关文章所有 7 个版本

[PDF] hal.science

Scheduling algorithms to optimize the performance, energy consumption and robustness of HPC applications

L Perotin - 2023 - theses.hal.science

This thesis addresses the problem of resilience in large-scale computer systems. Due to the
rapid development of high-performance computing technology, it has become crucial to …

高级搜索

QQ 群