An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org
Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

A scalable double in-memory checkpoint and restart scheme towards exascale

G Zheng, X Ni, LV Kalé - IEEE/IFIP International Conference on …, 2012 - ieeexplore.ieee.org
As the size of supercomputers increases, the probability of system failure grows
substantially, posing an increasingly significant challenge for scalability. It is important to …

Asynchronous checkpointing by dedicated checkpoint threads

F Shahzad, M Wittmann, T Zeiser, G Wellein - Recent Advances in the …, 2012 - Springer
Abstract Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large
HPC applications. Although it is relatively easy as compared to other fault tolerance …

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

G Zheng, L Shi, LV Kalé - … on cluster computing (ieee cat. no …, 2004 - ieeexplore.ieee.org
As high performance clusters continue to grow in size, the mean time between failures
shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the …

[PDF][PDF] A case study of incremental and background hybrid in-memory checkpointing

X Dong, Y Xie, N Muralimanohar… - Proc. of the 2010 Exascale …, 2010 - shiftleft.com
Future exascale computing systems will have high failure rates due to the sheer number of
components present in the system. A classic fault-tolerance technique used in today's …

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer
Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

Low-overhead diskless checkpoint for hybrid computing systems

LB Gomez, A Nukada, N Maruyama… - … Conference on High …, 2010 - ieeexplore.ieee.org
As the size of new supercomputers scales to tens of thousands of sockets, the mean time
between failures (MTBF) is decreasing to just several hours and long executions need some …

Techniques for efficient in-memory checkpointing

D Vogt, C Giuffrida, H Bos, AS Tanenbaum - ACM SIGOPS Operating …, 2014 - dl.acm.org
Checkpointing is a pivotal technique in system research, with applications ranging from
crash recovery to replay debugging. In this paper, we evaluate a number of in-memory …

[PDF][PDF] Reducing checkpoint creation overhead using data similarity

A Kongmunvattana - Int J Comput, 2015 - meacse.org
Checkpoint/restart is a common technique deployed in the high-performance computing
(HPC) systems to provide a fault-tolerant capability. The most widely deployed …