Exponential escape efficiency of SGD from sharp minima in non-stationary regime

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

被引用次数：81 相关文章所有 8 个版本

[PDF] mlr.press

Decentralized SGD and average-direction SAM are asymptotically equivalent

T Zhu, F He, K Chen, M Song… - … Conference on Machine …, 2023 - proceedings.mlr.press

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on
massive devices simultaneously without the control of a central server. However, existing …

被引用次数：13 相关文章所有 11 个版本

[PDF] arxiv.org

Why (and When) does Local SGD Generalize Better than SGD?

X Gu, K Lyu, L Huang, S Arora - arXiv preprint arXiv:2303.01215, 2023 - arxiv.org

Local SGD is a communication-efficient variant of SGD for large-scale training, where
multiple GPUs perform SGD independently and average the model parameters periodically …

被引用次数：26 相关文章所有 6 个版本

[PDF] arxiv.org

A Quadratic Synchronization Rule for Distributed Deep Learning

X Gu, K Lyu, S Arora, J Zhang, L Huang - arXiv preprint arXiv:2310.14423, 2023 - arxiv.org

In distributed deep learning with data parallelism, synchronizing gradients at each training
step can cause a huge communication overhead, especially when many nodes work …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

Large Deviations Analysis For Regret Minimizing Stochastic Approximation Algorithms

H Qian, V Krishnamurthy - arXiv preprint arXiv:2406.00414, 2024 - arxiv.org

Motivated by learning of correlated equilibria in non-cooperative games, we perform a large
deviations analysis of a regret minimizing stochastic approximation algorithm. The regret …

Implicit Bias of Deep Learning Optimization: A Mathematical Examination

K Lyu - 2024 - search.proquest.com

Deep learning has achieved remarkable success in recent years, yet training neural
networks often involves a delicate combination of guesswork and hyperparameter tuning. A …

高级搜索

QQ 群