A Quadratic Synchronization Rule for Distributed Deep Learning

X Gu, K Lyu, S Arora, J Zhang, L Huang - arXiv preprint arXiv:2310.14423, 2023 - arxiv.org
In distributed deep learning with data parallelism, synchronizing gradients at each training
step can cause a huge communication overhead, especially when many nodes work …