作者
Jiajun Cao, Gregory Kerr, Kapil Arya, Gene Cooperman
发表日期
2014/6/23
图书
Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
页码范围
13-24
简介
Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpoint-restart services that "tear down" the network, checkpoint each node in isolation, and then re-connect the network again. This work presents the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach simplifies current practice by avoiding the need for a privileged kernel module. The generality of this approach is demonstrated by applying it both to MPI and to Berkeley UPC (Unified Parallel C), in its native mode (without MPI). Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). The run-time overhead varies between 0.8% and 1.7%. While checkpoint times dominate, the …
引用总数
2014201520162017201820192020202120222023202422753636311
学术搜索中的文章
J Cao, G Kerr, K Arya, G Cooperman - Proceedings of the 23rd international symposium on …, 2014