AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …
Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles …
The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a\textit {perturbed loss} defined as the maximum loss within a neighborhood in …
K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help …
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used …
Since its inception in" Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a …
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …
We focus on the task of learning a single index model $\sigma (w^\star\cdot x) $ with respect to the isotropic Gaussian distribution in $ d $ dimensions. Prior work has shown that the …