Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization

K Wen, Z Li, T Ma - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Despite extensive studies, the underlying reason as to why overparameterizedneural
networks can generalize remains elusive. Existing theory shows that common stochastic …

Double trouble in double descent: Bias and variance (s) in the lazy regime

S d'Ascoli, M Refinetti, G Biroli… - … on Machine Learning, 2020 - proceedings.mlr.press
Deep neural networks can achieve remarkable generalization performances while
interpolating the training data. Rather than the U-curve emblematic of the bias-variance …

On the proof of global convergence of gradient descent for deep relu networks with linear widths

Q Nguyen - International Conference on Machine Learning, 2021 - proceedings.mlr.press
We give a simple proof for the global convergence of gradient descent in training deep
ReLU networks with the standard square loss, and show some of its improvements over the …

What causes the test error? going beyond bias-variance via anova

L Lin, E Dobriban - Journal of Machine Learning Research, 2021 - jmlr.org
Modern machine learning methods are often overparametrized, allowing adaptation to the
data at a fine level. This can seem puzzling; in the worst case, such models do not need to …

Kernel and rich regimes in overparametrized models

B Woodworth, S Gunasekar, JD Lee… - … on Learning Theory, 2020 - proceedings.mlr.press
A recent line of work studies overparametrized neural networks in the “kernel regime,” ie
when during training the network behaves as a kernelized linear predictor, and thus, training …

Rethinking bias-variance trade-off for generalization of neural networks

Z Yang, Y Yu, C You, J Steinhardt… - … on Machine Learning, 2020 - proceedings.mlr.press
The classical bias-variance trade-off predicts that bias decreases and variance increase with
model complexity, leading to a U-shaped risk curve. Recent work calls this into question for …

Training of deep neural networks based on distance measures using RMSProp

T Kurbiel, S Khaleghian - arXiv preprint arXiv:1708.01911, 2017 - arxiv.org
The vanishing gradient problem was a major obstacle for the success of deep learning. In
recent years it was gradually alleviated through multiple different techniques. However the …

Convex geometry and duality of over-parameterized neural networks

T Ergen, M Pilanci - Journal of machine learning research, 2021 - jmlr.org
We develop a convex analytic approach to analyze finite width two-layer ReLU networks.
We first prove that an optimal solution to the regularized training problem can be …

Universal readout for graph convolutional neural networks

N Navarin, D Van Tran… - 2019 international joint …, 2019 - ieeexplore.ieee.org
Several machine learning problems can be naturally defined over graph data. Recently,
many researchers have been focusing on the definition of neural networks for graphs. The …

Grokking: Generalization beyond overfitting on small algorithmic datasets

A Power, Y Burda, H Edwards, I Babuschkin… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper we propose to study generalization of neural networks on small algorithmically
generated datasets. In this setting, questions about data efficiency, memorization …