作者
Rohan Garg, Jiajun Cao, Kapil Arya, Gene Cooperman, Jérôme Vienne
发表日期
2016/7/17
图书
Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale
页码范围
1-8
简介
Batch environments are notoriously unfriendly because it's not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale.
Two strategies are proposed that take advantage of DMTCP (Distributed MultiThreaded CheckPointing) for system-level checkpointing. First, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for non-interactive batch environments. Second, we review how to use the SLURM resource manager capability to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. We argue for greater use of this lesser known capability, as a means to remove the …
引用总数
学术搜索中的文章
R Garg, J Cao, K Arya, G Cooperman, J Vienne - Proceedings of the XSEDE16 Conference on Diversity …, 2016