Deja vu: Contextual sparsity for efficient llms at inference time

Z Liu, J Wang, T Dao, T Zhou, B Yuan… - International …, 2023 - proceedings.mlr.press
Large language models (LLMs) with hundreds of billions of parameters have sparked a new
wave of exciting AI applications. However, they are computationally expensive at inference …

A theoretical analysis of deep Q-learning

J Fan, Z Wang, Y Xie, Z Yang - Learning for dynamics and …, 2020 - proceedings.mlr.press
Despite the great empirical success of deep reinforcement learning, its theoretical
foundation is less well understood. In this work, we make the first attempt to theoretically …

Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data

S Frei, NS Chatterji, P Bartlett - Conference on Learning …, 2022 - proceedings.mlr.press
Benign overfitting, the phenomenon where interpolating models generalize well in the
presence of noisy data, was first observed in neural network models trained with gradient …

How much over-parameterization is sufficient to learn deep ReLU networks?

Z Chen, Y Cao, D Zou, Q Gu - arXiv preprint arXiv:1911.12360, 2019 - arxiv.org
A recent line of research on deep learning focuses on the extremely over-parameterized
setting, and shows that when the network width is larger than a high degree polynomial of …

Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks?---A Neural Tangent Kernel Perspective

K Huang, Y Wang, M Tao… - Advances in neural …, 2020 - proceedings.neurips.cc
Deep residual networks (ResNets) have demonstrated better generalization performance
than deep feedforward networks (FFNets). However, the theory behind such a phenomenon …

Implicit bias in leaky relu networks trained on high-dimensional data

S Frei, G Vardi, PL Bartlett, N Srebro, W Hu - arXiv preprint arXiv …, 2022 - arxiv.org
The implicit biases of gradient-based optimization algorithms are conjectured to be a major
factor in the success of modern deep learning. In this work, we investigate the implicit bias of …

Implicit regularization of deep residual networks towards neural ODEs

P Marion, YH Wu, ME Sander, G Biau - arXiv preprint arXiv:2309.01213, 2023 - arxiv.org
Residual neural networks are state-of-the-art deep learning models. Their continuous-depth
analog, neural ordinary differential equations (ODEs), are also widely used. Despite their …

Proxy convexity: A unified framework for the analysis of neural networks trained by gradient descent

S Frei, Q Gu - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc
Although the optimization objectives for learning neural networks are highly non-convex,
gradient-based methods have been wildly successful at learning neural networks in …

Overparameterization of deep ResNet: zero loss and mean-field analysis

Z Ding, S Chen, Q Li, SJ Wright - Journal of machine learning research, 2022 - jmlr.org
Finding parameters in a deep neural network (NN) that fit training data is a nonconvex
optimization problem, but a basic first-order optimization method (gradient descent) finds a …

On the generalization of learning algorithms that do not converge

N Chandramoorthy, A Loukas… - Advances in Neural …, 2022 - proceedings.neurips.cc
Generalization analyses of deep learning typically assume that the training converges to a
fixed point. But, recent results indicate that in practice, the weights of deep neural networks …