Towards understanding how momentum improves generalization in deep learning

S Jelassi, M Sander, Y Li - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Abstract Vision Transformers (ViTs) have recently achieved comparable or superior
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …

被引用次数：61 相关文章所有 6 个版本

[PDF] neurips.cc

Implicit bias of gradient descent for two-layer reLU and leaky reLU networks on nearly-orthogonal data

Y Kou, Z Chen, Q Gu - Advances in Neural Information …, 2024 - proceedings.neurips.cc

The implicit bias towards solutions with favorable properties is believed to be a key reason
why neural networks trained by gradient-based optimization can generalize well. While the …

被引用次数：9 相关文章所有 7 个版本

[PDF] neurips.cc

Robust learning with progressive data expansion against spurious correlation

Y Deng, Y Yang, B Mirzasoleiman… - Advances in neural …, 2024 - proceedings.neurips.cc

While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …

被引用次数：14 相关文章所有 9 个版本

[PDF] neurips.cc

Why does sharpness-aware minimization generalize better than SGD?

Z Chen, J Zhang, Y Kou, X Chen… - Advances in neural …, 2024 - proceedings.neurips.cc

The challenge of overfitting, in which the model memorizes the training data and fails to
generalize to test data, has become increasingly significant in the training of large neural …

被引用次数：10 相关文章所有 7 个版本

[PDF] mlr.press

The benefits of mixup for feature learning

D Zou, Y Cao, Y Li, Q Gu - International Conference on …, 2023 - proceedings.mlr.press

Mixup, a simple data augmentation method that randomly mixes two data points via linear
interpolation, has been extensively applied in various deep learning applications to gain …

被引用次数：17 相关文章所有 9 个版本

[PDF] neurips.cc

Momentum provably improves error feedback!

I Fatkhullin, A Tyurin, P Richtárik - Advances in Neural …, 2024 - proceedings.neurips.cc

Due to the high communication overhead when training machine learning models in a
distributed environment, modern algorithms invariably rely on lossy communication …

被引用次数：11 相关文章所有 9 个版本

[PDF] arxiv.org

Solving regularized exp, cosh and sinh regression problems

Z Li, Z Song, T Zhou - arXiv preprint arXiv:2303.15725, 2023 - arxiv.org

In modern machine learning, attention computation is a fundamental task for training large
language models such as Transformer, GPT-4 and ChatGPT. In this work, we study …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

The marginal value of momentum for small learning rate sgd

R Wang, S Malladi, T Wang, K Lyu, Z Li - arXiv preprint arXiv:2307.15196, 2023 - arxiv.org

Momentum is known to accelerate the convergence of gradient descent in strongly convex
settings without stochastic gradient noise. In stochastic optimization, such as training neural …

被引用次数：6 相关文章所有 3 个版本

[PDF] openreview.net

Understanding convergence and generalization in federated learning through feature learning theory

W Huang, Y Shi, Z Cai, T Suzuki - The Twelfth International …, 2023 - openreview.net

Federated Learning (FL) has attracted significant attention as an efficient privacy-preserving
approach to distributed learning across multiple clients. Despite extensive empirical …

被引用次数：5 相关文章

[PDF] arxiv.org

Benign overfitting in two-layer relu convolutional neural networks for xor data

X Meng, D Zou, Y Cao - arXiv preprint arXiv:2310.01975, 2023 - arxiv.org

Modern deep learning models are usually highly over-parameterized so that they can overfit
the training data. Surprisingly, such overfitting neural networks can usually still achieve high …

被引用次数：6 相关文章所有 4 个版本

高级搜索

QQ 群