X Liu,
L Zeng - … on Networking, Architecture and Storage (NAS), 2024 - ieeexplore.ieee.org
When running deep learning training jobs, in order to prevent training loss due to
softwarelhardware failures, a checkpointing mechanism is usually used to periodically store …