CHPOX: Transparent checkpointing system for Linux clusters

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer

Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

被引用次数：346 相关文章所有 12 个版本

[PDF] arxiv.org

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org

DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

被引用次数：469 相关文章所有 24 个版本

[PDF] arxiv.org

Transparent checkpoint-restart over InfiniBand

J Cao, G Kerr, K Arya, G Cooperman - Proceedings of the 23rd …, 2014 - dl.acm.org

Transparently saving the state of the InfiniBand network as part of distributed checkpointing
has been a long-standing challenge for researchers. The lack of a solution has forced typical …

被引用次数：40 相关文章所有 10 个版本

[PDF] psu.edu

DAGMap: efficient and dependable scheduling of DAG workflow job in Grid

H Cao, H Jin, X Wu, S Wu, X Shi - The Journal of supercomputing, 2010 - Springer

DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be
heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes …

被引用次数：52 相关文章所有 11 个版本

ITALC: interactive tool for application-level checkpointing

R Arora, TN Ba - Proceedings of the Fourth International Workshop on …, 2017 - dl.acm.org

The computational resources at open-science supercomputing centers are shared among
multiple users at a given time, and hence are governed by policies that ensure their fair and …

被引用次数：17 相关文章

Object integration in logical database design

R Elmasri, S Navathe - 1984 IEEE First International …, 1984 - ieeexplore.ieee.org

View integration is one of the important phases in logical database design. During this
phase, the individual views designed by separate user groups within the organization are …

被引用次数：72 相关文章所有 5 个版本

[PDF] ieee.org

Rollback-free recovery for a high performance dense linear solver with reduced memory footprint

D Loreti, M Artioli, A Ciampolini - IEEE Transactions on Parallel …, 2024 - ieeexplore.ieee.org

The scale of nowadays High Performance Computing (HPC) systems is the key element that
determines the achievement of impressive performance, as well as the reason for their …

被引用次数：1 相关文章所有 3 个版本

[PDF] hal.science

Discontinuous Incremental: A new approach towards extremely lightweight checkpoints

VH Ha, É Renault - 2011 International Symposium on Computer …, 2011 - ieeexplore.ieee.org

Checkpointing is an important method for providing fault tolerance, load balancing, process
migration, periodic backup, and many other functions. It is also the basic tool used in CAPE …

被引用次数：21 相关文章所有 9 个版本

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

L Zhu, J Gu, Y Wang, T Zhao, Z Cai - The Journal of Supercomputing, 2015 - Springer

The complexity and scale of high-performance computer systems are rapidly increasing, so
fault tolerance is becoming a critical challenge. In this paper, we consider the impact of …

被引用次数：10 相关文章所有 6 个版本

Linux support for fast transparent general purpose checkpoint/restart of multithreaded processes in loadable kernel module

A Zarrabi, K Samsudin, WA Wan Adnan - Journal of grid computing, 2013 - Springer

Checkpoint/Restart is the ability to save the state of a running application so that it can later
resume its execution from the time of the checkpoint. These are techniques with many …

被引用次数：10 相关文章所有 7 个版本

高级搜索

QQ 群