Understanding the generalization benefit of normalization layers: Sharpness reduction

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

Decentralized SGD and average-direction SAM are asymptotically equivalent

T Zhu, F He, K Chen, M Song… - … Conference on Machine …, 2023 - proceedings.mlr.press
Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on
massive devices simultaneously without the control of a central server. However, existing …

Why (and When) does Local SGD Generalize Better than SGD?

X Gu, K Lyu, L Huang, S Arora - arXiv preprint arXiv:2303.01215, 2023 - arxiv.org
Local SGD is a communication-efficient variant of SGD for large-scale training, where
multiple GPUs perform SGD independently and average the model parameters periodically …

A Quadratic Synchronization Rule for Distributed Deep Learning

X Gu, K Lyu, S Arora, J Zhang, L Huang - arXiv preprint arXiv:2310.14423, 2023 - arxiv.org
In distributed deep learning with data parallelism, synchronizing gradients at each training
step can cause a huge communication overhead, especially when many nodes work …

Large Deviations Analysis For Regret Minimizing Stochastic Approximation Algorithms

H Qian, V Krishnamurthy - arXiv preprint arXiv:2406.00414, 2024 - arxiv.org
Motivated by learning of correlated equilibria in non-cooperative games, we perform a large
deviations analysis of a regret minimizing stochastic approximation algorithm. The regret …

Implicit Bias of Deep Learning Optimization: A Mathematical Examination

K Lyu - 2024 - search.proquest.com
Deep learning has achieved remarkable success in recent years, yet training neural
networks often involves a delicate combination of guesswork and hyperparameter tuning. A …