A practical survey on faster and lighter transformers

Q Fournier, GM Caron, D Aloise - ACM Computing Surveys, 2023 - dl.acm.org
Recurrent neural networks are effective models to process sequences. However, they are
unable to learn long-term dependencies because of their inherent sequential nature. As a …

Analyzing and improving the training dynamics of diffusion models

T Karras, M Aittala, J Lehtinen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Diffusion models currently dominate the field of data-driven image synthesis with their
unparalleled scaling to large datasets. In this paper we identify and rectify several causes for …

Understanding gradient descent on the edge of stability in deep learning

S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …

Understanding the generalization benefit of normalization layers: Sharpness reduction

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

[HTML][HTML] Deep learning approach towards accurate state of charge estimation for lithium-ion batteries using self-supervised transformer model

MA Hannan, DNT How, MSH Lipu, M Mansor, PJ Ker… - Scientific reports, 2021 - nature.com
Accurate state of charge (SOC) estimation of lithium-ion (Li-ion) batteries is crucial in
prolonging cell lifespan and ensuring its safe operation for electric vehicle applications. In …

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework

Z Li, T Wang, S Arora - arXiv preprint arXiv:2110.06914, 2021 - arxiv.org
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …

A modern look at the relationship between sharpness and generalization

M Andriushchenko, F Croce, M Müller, M Hein… - arXiv preprint arXiv …, 2023 - arxiv.org
Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …

On the validity of modeling sgd with stochastic differential equations (sdes)

Z Li, S Malladi, S Arora - Advances in Neural Information …, 2021 - proceedings.neurips.cc
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is
important for good generalization in real-life deep nets. Most attempted explanations …

Adapting the linearised laplace model evidence for modern deep learning

J Antorán, D Janz, JU Allingham… - International …, 2022 - proceedings.mlr.press
The linearised Laplace method for estimating model uncertainty has received renewed
attention in the Bayesian deep learning community. The method provides reliable error bars …