Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

Implicit bias of the step size in linear diagonal neural networks

MS Nacson, K Ravichandran… - International …, 2022 - proceedings.mlr.press
Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …

On the maximum hessian eigenvalue and generalization

S Kaur, J Cohen, ZC Lipton - Proceedings on, 2023 - proceedings.mlr.press
The mechanisms by which certain training interventions, such as increasing learning rates
and applying batch normalization, improve the generalization of deep networks remains a …

Spectral evolution and invariance in linear-width neural networks

Z Wang, A Engel, AD Sarwate… - Advances in neural …, 2023 - proceedings.neurips.cc
We investigate the spectral properties of linear-width feed-forward neural networks, where
the sample size is asymptotically proportional to network width. Empirically, we show that the …

Direction matters: On the implicit bias of stochastic gradient descent with moderate learning rate

J Wu, D Zou, V Braverman, Q Gu - arXiv preprint arXiv:2011.02538, 2020 - arxiv.org
Understanding the algorithmic bias of\emph {stochastic gradient descent}(SGD) is one of the
key challenges in modern machine learning and deep learning theory. Most of the existing …

On the benefits of large learning rates for kernel methods

G Beugnot, J Mairal, A Rudi - Conference on Learning …, 2022 - proceedings.mlr.press
This paper studies an intriguing phenomenon related to the good generalization
performance of estimators obtained by using large learning rates within gradient descent …

Robust recovery via implicit bias of discrepant learning rates for double over-parameterization

C You, Z Zhu, Q Qu, Y Ma - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Recent advances have shown that implicit bias of gradient descent on over-parameterized
models enables the recovery of low-rank matrices from linear measurements, even with no …

Learning associative memories with gradient descent

V Cabannes, B Simsek, A Bietti - arXiv preprint arXiv:2402.18724, 2024 - arxiv.org
This work focuses on the training dynamics of one associative memory module storing outer
products of token embeddings. We reduce this problem to the study of a system of particles …

An empirical study of pre-trained vision models on out-of-distribution generalization

Y Yu, H Jiang, D Bahri, H Mobahi, S Kim… - … 2021 Workshop on …, 2021 - openreview.net
Generalizing to out-of-distribution (OOD) data--that is, data from domains unseen during
training--is a key challenge in modern machine learning, which has only recently received …