Vision transformers provably learn spatial structure

S Jelassi, M Sander, Y Li - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Vision Transformers (ViTs) have recently achieved comparable or superior
performance to Convolutional neural networks (CNNs) in computer vision. This empirical …

Implicit bias of gradient descent for two-layer reLU and leaky reLU networks on nearly-orthogonal data

Y Kou, Z Chen, Q Gu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
The implicit bias towards solutions with favorable properties is believed to be a key reason
why neural networks trained by gradient-based optimization can generalize well. While the …

Robust learning with progressive data expansion against spurious correlation

Y Deng, Y Yang, B Mirzasoleiman… - Advances in neural …, 2024 - proceedings.neurips.cc
While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …

Why does sharpness-aware minimization generalize better than SGD?

Z Chen, J Zhang, Y Kou, X Chen… - Advances in neural …, 2024 - proceedings.neurips.cc
The challenge of overfitting, in which the model memorizes the training data and fails to
generalize to test data, has become increasingly significant in the training of large neural …

The benefits of mixup for feature learning

D Zou, Y Cao, Y Li, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
Mixup, a simple data augmentation method that randomly mixes two data points via linear
interpolation, has been extensively applied in various deep learning applications to gain …

Momentum provably improves error feedback!

I Fatkhullin, A Tyurin, P Richtárik - Advances in Neural …, 2024 - proceedings.neurips.cc
Due to the high communication overhead when training machine learning models in a
distributed environment, modern algorithms invariably rely on lossy communication …

Solving regularized exp, cosh and sinh regression problems

Z Li, Z Song, T Zhou - arXiv preprint arXiv:2303.15725, 2023 - arxiv.org
In modern machine learning, attention computation is a fundamental task for training large
language models such as Transformer, GPT-4 and ChatGPT. In this work, we study …

The marginal value of momentum for small learning rate sgd

R Wang, S Malladi, T Wang, K Lyu, Z Li - arXiv preprint arXiv:2307.15196, 2023 - arxiv.org
Momentum is known to accelerate the convergence of gradient descent in strongly convex
settings without stochastic gradient noise. In stochastic optimization, such as training neural …

Understanding convergence and generalization in federated learning through feature learning theory

W Huang, Y Shi, Z Cai, T Suzuki - The Twelfth International …, 2023 - openreview.net
Federated Learning (FL) has attracted significant attention as an efficient privacy-preserving
approach to distributed learning across multiple clients. Despite extensive empirical …

Benign overfitting in two-layer relu convolutional neural networks for xor data

X Meng, D Zou, Y Cao - arXiv preprint arXiv:2310.01975, 2023 - arxiv.org
Modern deep learning models are usually highly over-parameterized so that they can overfit
the training data. Surprisingly, such overfitting neural networks can usually still achieve high …