Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

PAC-Bayes compression bounds so tight that they can explain generalization

S Lotfi, M Finzi, S Kapoor… - Advances in …, 2022 - proceedings.neurips.cc
While there has been progress in developing non-vacuous generalization bounds for deep
neural networks, these bounds tend to be uninformative about why deep learning works. In …

When do flat minima optimizers work?

J Kaddour, L Liu, R Silva… - Advances in Neural …, 2022 - proceedings.neurips.cc
Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods,
have been shown to improve a neural network's generalization performance over stochastic …

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

M Even, S Pesme, S Gunasekar… - Advances in Neural …, 2023 - proceedings.neurips.cc
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …

Can neural nets learn the same model twice? investigating reproducibility and double descent from the decision boundary perspective

G Somepalli, L Fowl, A Bansal… - Proceedings of the …, 2022 - openaccess.thecvf.com
We discuss methods for visualizing neural network decision boundaries and decision
regions. We use these visualizations to investigate issues related to reproducibility and …

Subspace adversarial training

T Li, Y Wu, S Chen, K Fang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Single-step adversarial training (AT) has received wide attention as it proved to be both
efficient and robust. However, a serious problem of catastrophic overfitting exists, ie, the …

Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks

F Chen, D Kunin, A Yamamura… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives
overly expressive networks to much simpler subnetworks, thereby dramatically reducing the …

Why neural networks find simple solutions: The many regularizers of geometric complexity

B Dherin, M Munn, M Rosca… - Advances in Neural …, 2022 - proceedings.neurips.cc
In many contexts, simpler models are preferable to more complex models and the control of
this model complexity is the goal for many methods in machine learning such as …

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be

F Kunstner, J Chen, JW Lavington… - arXiv preprint arXiv …, 2023 - arxiv.org
The success of the Adam optimizer on a wide array of architectures has made it the default
in settings where stochastic gradient descent (SGD) performs poorly. However, our …

(S) GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability

M Even, S Pesme, S Gunasekar… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over …