FTPA: Supporting fault-tolerant parallel computing through parallel recomputing

X Yang, Y Du, P Wang, H Fu… - IEEE Transactions on …, 2008 - ieeexplore.ieee.org
As the size of large-scale computer systems increases, their mean-time-between-failures are
becoming significantly shorter than the execution time of many current scientific applications …

A new and efficient algorithm-based fault tolerance scheme for a million way parallelism

E Yao, M Chen, R Wang, W Zhang, G Tan - arXiv preprint arXiv:1106.4213, 2011 - arxiv.org
Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. HPC applications typically …

A survey of checkpoint/restart techniques on distributed memory systems

F Shahzad, M Wittmann, M Kreutzer… - Parallel Processing …, 2013 - World Scientific
The road to exascale computing poses many challenges for the High Performance
Computing (HPC) community. Each step on the exascale path is mainly the result of a higher …

[图书][B] Scalable techniques for fault tolerant high performance computing

Z Chen - 2006 - search.proquest.com
As the number of processors in today's parallel systems continues to grow, the mean-time-to-
failure of these systems is becoming significantly shorter than the execution time of many …

[PDF][PDF] Portable checkpointing for parallel applications

G Bronevetsky - 2006 - ecommons.cornell.edu
High Performance Computing (HPC) systems represent the peak of modern computational
capability. As ever-increasing demands for computational power have fuelled the demand …

Asynchronous checkpoint migration with mrnet in the scalable checkpoint/restart library

K Mohror, A Moody… - IEEE/IFIP International …, 2012 - ieeexplore.ieee.org
Applications running on today's supercomputers tolerate failures by periodically saving their
state in checkpoint files on stable storage, such as a parallel file system. Although this …

A scalable asynchronous replication-based strategy for fault tolerant MPI applications

JP Walters, V Chaudhary - International Conference on High-Performance …, 2007 - Springer
As computational clusters increase in size, their mean-time-to-failure reduces. Typically
checkpointing is used to minimize the loss of computation. Most checkpointing techniques …

A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

W Bland, P Du, A Bouteiller, T Herault… - Euro-Par 2012 Parallel …, 2012 - Springer
Most predictions of Exascale machines picture billion way parallelism, encompassing not
only millions of cores, but also tens of thousands of nodes. Even considering extremely …

Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes

I Cores, G Rodríguez, MJ Martín, P González… - New Generation …, 2013 - Springer
The execution times of large-scale parallel applications on nowadays multi/many-core
systems are usually longer than the mean time between failures. Therefore, parallel …

Replication-based fault tolerance for MPI applications

JP Walters, V Chaudhary - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org
As computational clusters increase in size, their mean time to failure reduces drastically.
Typically, checkpointing is used to minimize the loss of computation. Most checkpointing …