查看文章

amazon.science 中的 [HTML]

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

作者

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eugene Ng, Yida Wang

发表日期

2023/10/23

图书

Proceedings of the 29th Symposium on Operating Systems Principles

页码范围

364-381

简介

Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing solutions have significant failure recovery costs due to the severe restriction imposed by the bandwidth of remote storage in which they store checkpoints.

This paper presents Gemini, a distributed training system that enables fast failure recovery for large model training by checkpointing to CPU memory of the host machines with much larger aggregated bandwidth. However, two challenges prevent naïvely checkpointing to CPU memory. First, the availability of checkpoints in CPU memory cannot be guaranteed when failures occur. Second, since the communication traffic for training and checkpointing share the same network, checkpoint traffic can interfere with …

引用总数

被引用次数：15

202320244 11

学术搜索中的文章

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu, TSE Ng… - Proceedings of the 29th Symposium on Operating …, 2023

被引用次数：15 相关文章所有 6 个版本