相关文章- 学术资源搜索

Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer

Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

被引用次数：1 相关文章所有 2 个版本

[PDF] springer.com

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer

Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org

While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

被引用次数：6 相关文章所有 2 个版本

[PDF] jst.go.jp

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp

With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

被引用次数：6 相关文章所有 7 个版本

Checkpointing vs. supervision resilience approaches for dynamic independent tasks

J Posner, L Reitz, C Fohry - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org

With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

被引用次数：5 相关文章所有 3 个版本

[PDF] ens-lyon.fr

Efficient checkpointing of multi-threaded applications as a tool for debugging, performance tuning, and resiliency

M Grossman, V Sarkar - 2016 IEEE International Parallel and …, 2016 - ieeexplore.ieee.org

Past work on application checkpointing systems has either focused on enabling application
resiliency or as a tool for debugging (as in record-replay literature). Each of these use cases …

被引用次数：7 相关文章所有 3 个版本

[PDF] psu.edu

Application-level checkpointing for shared memory programs

G Bronevetsky, D Marques, K Pingali, P Szwed… - ACM SIGPLAN …, 2004 - dl.acm.org

Trends in high-performance computing are making it necessary for long-running
applications to tolerate hardware faults. The most commonly used approach is checkpoint …

被引用次数：177 相关文章所有 17 个版本

[PDF] oelzant.priv.at

[PDF][PDF] Linux support for transparent checkpointing of multithreaded programs

CD Carothers, BK Szymanski - To appear in Dr. Dobbs Journal, 2002 - oelzant.priv.at

The most common use of checkpointing is in fault tolerant computing where the goal is to
minimize loss of CPU cycles when a long executing program crashes before completion. By …

被引用次数：19 相关文章所有 7 个版本

An application-level checkpointing based on extended data flow analysis for openMP programs

HY Fu, Y Ding, W Song, XJ Yang - Jisuanji Xuebao(Chinese Journal of …, 2010 - cjc.ict.ac.cn

As the wide application of multi-core processor architecture in the domain of high
performance computing, fault tolerance for shared memory parallel programs becomes a hot …

被引用次数：10 相关文章

[PDF] illinois.edu

Rebound: scalable checkpointing for coherent shared memory

R Agarwal, P Garg, J Torrellas - Proceedings of the 38th annual …, 2011 - dl.acm.org

As we move to large manycores, the hardware-based global check-pointing schemes that
have been proposed for small shared-memory machines do not scale. Scalability barriers …

被引用次数：37 相关文章所有 14 个版本

高级搜索

QQ 群