R Sun, D Li, S Liang, T Ding… - IEEE Signal Processing …, 2020 - ieeexplore.ieee.org
One of the major concerns for neural network training is that the nonconvexity of the associated loss functions may cause a bad landscape. The recent success of neural …
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
C Fang, H He, Q Long, WJ Su - Proceedings of the National …, 2021 - National Acad Sciences
In this paper, we introduce the Layer-Peeled Model, a nonconvex, yet analytically tractable, optimization program, in a quest to better understand deep neural networks that are trained …
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well …
R Sun - arXiv preprint arXiv:1912.08957, 2019 - arxiv.org
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss …
We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are …
Z Li, T Wang, S Arora - arXiv preprint arXiv:2110.06914, 2021 - arxiv.org
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local …
Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks …
We study how permutation symmetries in overparameterized multi-layer neural networks generate 'symmetry-induced'critical points. Assuming a network with $ L $ layers of minimal …