On the implicit bias in deep-learning algorithms

G Vardi - Communications of the ACM, 2023 - dl.acm.org
On the Implicit Bias in Deep-Learning Algorithms Page 1 DEEP LEARNING HAS been highly
successful in recent years and has led to dramatic improvements in multiple domains …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Understanding gradient descent on the edge of stability in deep learning

S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Surrogate gap minimization improves sharpness-aware training

J Zhuang, B Gong, L Yuan, Y Cui, H Adam… - arXiv preprint arXiv …, 2022 - arxiv.org
The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by
minimizing a\textit {perturbed loss} defined as the maximum loss within a neighborhood in …

Understanding the generalization benefit of normalization layers: Sharpness reduction

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arXiv preprint arXiv …, 2023 - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A Damian, E Nichani, JD Lee - arXiv preprint arXiv:2209.15594, 2022 - arxiv.org
Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …

Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models

A Damian, E Nichani, R Ge… - Advances in Neural …, 2024 - proceedings.neurips.cc
We focus on the task of learning a single index model $\sigma (w^\star\cdot x) $ with respect
to the isotropic Gaussian distribution in $ d $ dimensions. Prior work has shown that the …