X Lian, SA Jacobs, L Kurilenko, M Tanaka… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing checkpointing approaches seem ill-suited for distributed training even though
hardware limitations make model parallelism, ie, sharding model state across multiple …