相关文章- 学术资源搜索

FTPA: Supporting fault-tolerant parallel computing through parallel recomputing

X Yang, Y Du, P Wang, H Fu… - IEEE Transactions on …, 2008 - ieeexplore.ieee.org

As the size of large-scale computer systems increases, their mean-time-between-failures are
becoming significantly shorter than the execution time of many current scientific applications …

被引用次数：29 相关文章所有 4 个版本

[PDF] arxiv.org

A new and efficient algorithm-based fault tolerance scheme for a million way parallelism

E Yao, M Chen, R Wang, W Zhang, G Tan - arXiv preprint arXiv:1106.4213, 2011 - arxiv.org

Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. HPC applications typically …

被引用次数：8 相关文章所有 3 个版本

[PDF] researchgate.net

A survey of checkpoint/restart techniques on distributed memory systems

F Shahzad, M Wittmann, M Kreutzer… - Parallel Processing …, 2013 - World Scientific

The road to exascale computing poses many challenges for the High Performance
Computing (HPC) community. Each step on the exascale path is mainly the result of a higher …

被引用次数：24 相关文章所有 4 个版本

[PDF] tennessee.edu

[图书][B] Scalable techniques for fault tolerant high performance computing

Z Chen - 2006 - search.proquest.com

As the number of processors in today's parallel systems continues to grow, the mean-time-to-
failure of these systems is becoming significantly shorter than the execution time of many …

被引用次数：16 相关文章所有 3 个版本

[PDF] cornell.edu

[PDF][PDF] Portable checkpointing for parallel applications

G Bronevetsky - 2006 - ecommons.cornell.edu

High Performance Computing (HPC) systems represent the peak of modern computational
capability. As ever-increasing demands for computational power have fuelled the demand …

被引用次数：8 相关文章所有 3 个版本

[PDF] osti.gov

Asynchronous checkpoint migration with mrnet in the scalable checkpoint/restart library

K Mohror, A Moody… - IEEE/IFIP International …, 2012 - ieeexplore.ieee.org

Applications running on today's supercomputers tolerate failures by periodically saving their
state in checkpoint files on stable storage, such as a parallel file system. Although this …

被引用次数：7 相关文章所有 6 个版本

[PDF] psu.edu

A scalable asynchronous replication-based strategy for fault tolerant MPI applications

JP Walters, V Chaudhary - International Conference on High-Performance …, 2007 - Springer

As computational clusters increase in size, their mean-time-to-failure reduces. Typically
checkpointing is used to minimize the loss of computation. Most checkpointing techniques …

被引用次数：10 相关文章所有 14 个版本

[PDF] researchgate.net

A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

W Bland, P Du, A Bouteiller, T Herault… - Euro-Par 2012 Parallel …, 2012 - Springer

Most predictions of Exascale machines picture billion way parallelism, encompassing not
only millions of cores, but also tens of thousands of nodes. Even considering extremely …

被引用次数：51 相关文章所有 17 个版本

[PDF] udc.es

Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes

I Cores, G Rodríguez, MJ Martín, P González… - New Generation …, 2013 - Springer

The execution times of large-scale parallel applications on nowadays multi/many-core
systems are usually longer than the mean time between failures. Therefore, parallel …

被引用次数：34 相关文章所有 9 个版本

[PDF] psu.edu

Replication-based fault tolerance for MPI applications

JP Walters, V Chaudhary - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org

As computational clusters increase in size, their mean time to failure reduces drastically.
Typically, checkpointing is used to minimize the loss of computation. Most checkpointing …

被引用次数：66 相关文章所有 16 个版本

高级搜索

QQ 群