Extended Batch Sessions and Three-Phase Debugging: Using DMTCP to Enhance the Batch Environment

R Garg, J Cao, K Arya, G Cooperman… - Proceedings of the …, 2016 - dl.acm.org
Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at …, 2016dl.acm.org
Batch environments are notoriously unfriendly because it's not easy to interactively diagnose
the health of a job. A job may be terminated without warning when it reaches the end of an
allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs
only at large scale. Two strategies are proposed that take advantage of DMTCP (Distributed
MultiThreaded CheckPointing) for system-level checkpointing. First, we describe a three-
phase debugging strategy that permits one to interactively debug long-running MPI …
Batch environments are notoriously unfriendly because it's not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale.
Two strategies are proposed that take advantage of DMTCP (Distributed MultiThreaded CheckPointing) for system-level checkpointing. First, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for non-interactive batch environments. Second, we review how to use the SLURM resource manager capability to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. We argue for greater use of this lesser known capability, as a means to remove the necessity for the application-specific checkpointing found in many long-running jobs.
ACM Digital Library