A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org
DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

Transparent checkpoint-restart over InfiniBand

J Cao, G Kerr, K Arya, G Cooperman - Proceedings of the 23rd …, 2014 - dl.acm.org
Transparently saving the state of the InfiniBand network as part of distributed checkpointing
has been a long-standing challenge for researchers. The lack of a solution has forced typical …

DAGMap: efficient and dependable scheduling of DAG workflow job in Grid

H Cao, H Jin, X Wu, S Wu, X Shi - The Journal of supercomputing, 2010 - Springer
DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be
heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes …

ITALC: interactive tool for application-level checkpointing

R Arora, TN Ba - Proceedings of the Fourth International Workshop on …, 2017 - dl.acm.org
The computational resources at open-science supercomputing centers are shared among
multiple users at a given time, and hence are governed by policies that ensure their fair and …

Object integration in logical database design

R Elmasri, S Navathe - 1984 IEEE First International …, 1984 - ieeexplore.ieee.org
View integration is one of the important phases in logical database design. During this
phase, the individual views designed by separate user groups within the organization are …

Rollback-free recovery for a high performance dense linear solver with reduced memory footprint

D Loreti, M Artioli, A Ciampolini - IEEE Transactions on Parallel …, 2024 - ieeexplore.ieee.org
The scale of nowadays High Performance Computing (HPC) systems is the key element that
determines the achievement of impressive performance, as well as the reason for their …

Discontinuous Incremental: A new approach towards extremely lightweight checkpoints

VH Ha, É Renault - 2011 International Symposium on Computer …, 2011 - ieeexplore.ieee.org
Checkpointing is an important method for providing fault tolerance, load balancing, process
migration, periodic backup, and many other functions. It is also the basic tool used in CAPE …

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

L Zhu, J Gu, Y Wang, T Zhao, Z Cai - The Journal of Supercomputing, 2015 - Springer
The complexity and scale of high-performance computer systems are rapidly increasing, so
fault tolerance is becoming a critical challenge. In this paper, we consider the impact of …

Linux support for fast transparent general purpose checkpoint/restart of multithreaded processes in loadable kernel module

A Zarrabi, K Samsudin, WA Wan Adnan - Journal of grid computing, 2013 - Springer
Checkpoint/Restart is the ability to save the state of a running application so that it can later
resume its execution from the time of the checkpoint. These are techniques with many …