K Haruki,
T Suzuki, Y Hamakawa, T Toda… - arXiv preprint arXiv …, 2019 - arxiv.org
Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep
learning because of its training-time efficiency, however, extremely large-batch SGD leads to …