Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

Z Ji, M Telgarsky - arXiv preprint arXiv:1909.12292, 2019 - arxiv.org
Recent theoretical work has guaranteed that overparameterized networks trained by
gradient descent achieve arbitrarily low training error, and sometimes even low test error …

How much over-parameterization is sufficient to learn deep ReLU networks?

Z Chen, Y Cao, D Zou, Q Gu - arXiv preprint arXiv:1911.12360, 2019 - arxiv.org
A recent line of research on deep learning focuses on the extremely over-parameterized
setting, and shows that when the network width is larger than a high degree polynomial of …

The interpolation phase transition in neural networks: Memorization and generalization under lazy training

A Montanari, Y Zhong - The Annals of Statistics, 2022 - projecteuclid.org
The interpolation phase transition in neural networks: Memorization and generalization
under lazy training Page 1 The Annals of Statistics 2022, Vol. 50, No. 5, 2816–2847 https://doi.org/10.1214/22-AOS2211 …

On the optimization and generalization of multi-head attention

P Deora, R Ghaderi, H Taheri… - arXiv preprint arXiv …, 2023 - arxiv.org
The training and generalization dynamics of the Transformer's core mechanism, namely the
Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on …

Bounding the width of neural networks via coupled initialization a worst case analysis

A Munteanu, S Omlor, Z Song… - … on Machine Learning, 2022 - proceedings.mlr.press
A common method in training neural networks is to initialize all the weights to be
independent Gaussian vectors. We observe that by instead initializing the weights into …

On the proof of global convergence of gradient descent for deep relu networks with linear widths

Q Nguyen - International Conference on Machine Learning, 2021 - proceedings.mlr.press
We give a simple proof for the global convergence of gradient descent in training deep
ReLU networks with the standard square loss, and show some of its improvements over the …

Robust learning for data poisoning attacks

Y Wang, P Mianjy, R Arora - International Conference on …, 2021 - proceedings.mlr.press
We investigate the robustness of stochastic approximation approaches against data
poisoning attacks. We focus on two-layer neural networks with ReLU activation and show …

Global convergence of deep networks with one wide layer followed by pyramidal topology

QN Nguyen, M Mondelli - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Recent works have shown that gradient descent can find a global minimum for over-
parameterized neural networks where the widths of all the hidden layers scale polynomially …

Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian

S Oymak, Z Fabian, M Li, M Soltanolkotabi - arXiv preprint arXiv …, 2019 - arxiv.org
Modern neural network architectures often generalize well despite containing many more
parameters than the size of the training dataset. This paper explores the generalization …

Six lectures on linearized neural networks

T Misiakiewicz, A Montanari - arXiv preprint arXiv:2308.13431, 2023 - arxiv.org
In these six lectures, we examine what can be learnt about the behavior of multi-layer neural
networks from the analysis of linear models. We first recall the correspondence between …