Focusing on diagonal linear networks as a model for understanding the implicit bias in underdetermined models, we show how the gradient descent step size can have a large …
Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and …
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a …
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the …
Understanding the algorithmic bias of\emph {stochastic gradient descent}(SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing …
This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent …
C You, Z Zhu, Q Qu, Y Ma - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Recent advances have shown that implicit bias of gradient descent on over-parameterized models enables the recovery of low-rank matrices from linear measurements, even with no …
This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles …
Generalizing to out-of-distribution (OOD) data--that is, data from domains unseen during training--is a key challenge in modern machine learning, which has only recently received …