作者
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eugene Ng, Yida Wang
发表日期
2023/10/23
图书
Proceedings of the 29th Symposium on Operating Systems Principles
页码范围
364-381
简介
Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing solutions have significant failure recovery costs due to the severe restriction imposed by the bandwidth of remote storage in which they store checkpoints.
This paper presents Gemini, a distributed training system that enables fast failure recovery for large model training by checkpointing to CPU memory of the host machines with much larger aggregated bandwidth. However, two challenges prevent naïvely checkpointing to CPU memory. First, the availability of checkpoints in CPU memory cannot be guaranteed when failures occur. Second, since the communication traffic for training and checkpointing share the same network, checkpoint traffic can interfere with …
引用总数
学术搜索中的文章
Z Wang, Z Jia, S Zheng, Z Zhang, X Fu, TSE Ng… - Proceedings of the 29th Symposium on Operating …, 2023