Label noise sgd provably prefers flat global minimizers

G Vardi - Communications of the ACM, 2023 - dl.acm.org

On the Implicit Bias in Deep-Learning Algorithms Page 1 DEEP LEARNING HAS been highly
successful in recent years and has led to dramatic improvements in multiple domains …

被引用次数：94 相关文章所有 5 个版本

[PDF] arxiv.org

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org

AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

被引用次数：4550 相关文章所有 2 个版本

[PDF] mlr.press

Understanding gradient descent on the edge of stability in deep learning

S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press

Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …

被引用次数：114 相关文章所有 7 个版本

[PDF] neurips.cc

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

被引用次数：47 相关文章所有 6 个版本

[PDF] arxiv.org

Surrogate gap minimization improves sharpness-aware training

J Zhuang, B Gong, L Yuan, Y Cui, H Adam… - arXiv preprint arXiv …, 2022 - arxiv.org

The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by
minimizing a\textit {perturbed loss} defined as the maximum loss within a neighborhood in …

被引用次数：165 相关文章所有 4 个版本

[PDF] neurips.cc

Understanding the generalization benefit of normalization layers: Sharpness reduction

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

被引用次数：80 相关文章所有 8 个版本

[PDF] mlr.press

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

被引用次数：60 相关文章所有 7 个版本

[PDF] arxiv.org

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arXiv preprint arXiv …, 2023 - arxiv.org

Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

被引用次数：75 相关文章所有 2 个版本

[PDF] arxiv.org

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A Damian, E Nichani, JD Lee - arXiv preprint arXiv:2209.15594, 2022 - arxiv.org

Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …

被引用次数：85 相关文章所有 5 个版本

[PDF] neurips.cc

Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models

A Damian, E Nichani, R Ge… - Advances in Neural …, 2024 - proceedings.neurips.cc

We focus on the task of learning a single index model $\sigma (w^\star\cdot x) $ with respect
to the isotropic Gaussian distribution in $ d $ dimensions. Prior work has shown that the …

被引用次数：37 相关文章所有 7 个版本

高级搜索

QQ 群