查看文章

researchgate.net 中的 [PDF]

Building algorithmically nonstop fault tolerant MPI programs

作者

Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, Pavan Balaji, Darius Buntinas

发表日期

2011/12/18

研讨会论文

2011 18th International Conference on High Performance Computing

页码范围

1-9

出版商

IEEE

简介

With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the …

引用总数

被引用次数：29

201020112012201320142015201620172018201920202021202220232 1 1 4 1 2 3 4 4 2 1 2 2

学术搜索中的文章

Building algorithmically nonstop fault tolerant MPI programs

R Wang, E Yao, M Chen, G Tan, P Balaji, D Buntinas - 2011 18th International Conference on High …, 2011

被引用次数：29 相关文章所有 8 个版本