作者
Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K Panda, Hari Subramoni, Jérôme Vienne, Gene Cooperman
发表日期
2016/12/13
研讨会论文
2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)
页码范围
932-941
出版商
IEEE
简介
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores …
引用总数
20172018201920202021202220232024248134434
学术搜索中的文章
J Cao, K Arya, R Garg, S Matott, DK Panda… - 2016 IEEE 22nd International Conference on Parallel …, 2016