Learning rate annealing can provably help generalization, even for convex problems

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

被引用次数：61 相关文章所有 7 个版本

[PDF] mlr.press

Implicit bias of the step size in linear diagonal neural networks

MS Nacson, K Ravichandran… - International …, 2022 - proceedings.mlr.press

Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …

被引用次数：55 相关文章所有 3 个版本

[PDF] neurips.cc

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc

Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …

被引用次数：42 相关文章所有 5 个版本

[PDF] mlr.press

On the maximum hessian eigenvalue and generalization

S Kaur, J Cohen, ZC Lipton - Proceedings on, 2023 - proceedings.mlr.press

The mechanisms by which certain training interventions, such as increasing learning rates
and applying batch normalization, improve the generalization of deep networks remains a …

被引用次数：43 相关文章所有 5 个版本

[PDF] neurips.cc

Spectral evolution and invariance in linear-width neural networks

Z Wang, A Engel, AD Sarwate… - Advances in neural …, 2023 - proceedings.neurips.cc

We investigate the spectral properties of linear-width feed-forward neural networks, where
the sample size is asymptotically proportional to network width. Empirically, we show that the …

被引用次数：18 相关文章所有 6 个版本

[PDF] arxiv.org

Direction matters: On the implicit bias of stochastic gradient descent with moderate learning rate

J Wu, D Zou, V Braverman, Q Gu - arXiv preprint arXiv:2011.02538, 2020 - arxiv.org

Understanding the algorithmic bias of\emph {stochastic gradient descent}(SGD) is one of the
key challenges in modern machine learning and deep learning theory. Most of the existing …

被引用次数：44 相关文章所有 10 个版本

[PDF] mlr.press

On the benefits of large learning rates for kernel methods

G Beugnot, J Mairal, A Rudi - Conference on Learning …, 2022 - proceedings.mlr.press

This paper studies an intriguing phenomenon related to the good generalization
performance of estimators obtained by using large learning rates within gradient descent …

被引用次数：19 相关文章所有 6 个版本

[PDF] neurips.cc

Robust recovery via implicit bias of discrepant learning rates for double over-parameterization

C You, Z Zhu, Q Qu, Y Ma - Advances in Neural Information …, 2020 - proceedings.neurips.cc

Recent advances have shown that implicit bias of gradient descent on over-parameterized
models enables the recovery of low-rank matrices from linear measurements, even with no …

被引用次数：38 相关文章所有 10 个版本

[PDF] arxiv.org

Learning associative memories with gradient descent

V Cabannes, B Simsek, A Bietti - arXiv preprint arXiv:2402.18724, 2024 - arxiv.org

This work focuses on the training dynamics of one associative memory module storing outer
products of token embeddings. We reduce this problem to the study of a system of particles …

被引用次数：5 相关文章所有 4 个版本

[PDF] openreview.net

An empirical study of pre-trained vision models on out-of-distribution generalization

Y Yu, H Jiang, D Bahri, H Mobahi, S Kim… - … 2021 Workshop on …, 2021 - openreview.net

Generalizing to out-of-distribution (OOD) data--that is, data from domains unseen during
training--is a key challenge in modern machine learning, which has only recently received …

被引用次数：22 相关文章

高级搜索

QQ 群