Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org
Proceedings of the 29th Symposium on Operating Systems Principles, 2023dl.acm.org
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model
training due to large-scale resources involved and extended training time. Existing solutions
have significant failure recovery costs due to the severe restriction imposed by the
bandwidth of remote storage in which they store checkpoints. This paper presents Gemini, a
distributed training system that enables fast failure recovery for large model training by …
Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing solutions have significant failure recovery costs due to the severe restriction imposed by the bandwidth of remote storage in which they store checkpoints.
This paper presents Gemini, a distributed training system that enables fast failure recovery for large model training by checkpointing to CPU memory of the host machines with much larger aggregated bandwidth. However, two challenges prevent naïvely checkpointing to CPU memory. First, the availability of checkpoints in CPU memory cannot be guaranteed when failures occur. Second, since the communication traffic for training and checkpointing share the same network, checkpoint traffic can interfere with training traffic and harm training throughput. To address these two challenges, this paper proposes: 1) a provably near-optimal checkpoint placement strategy to maximize the probability of failure recovery from checkpoints in CPU memory; and 2) a checkpoint traffic scheduling algorithm to minimize, if not eliminate, the interference of checkpoint traffic on model training. Our evaluation shows that overall Gemini achieves a faster failure recovery by more than 13× than existing solutions. Moreover, it achieves optimal checkpoint frequency, i.e., every iteration, and incurs no overhead on training throughput for large model training.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果