Y Kou, Z Chen, Q Gu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the …
While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable _spurious features_ rather than the core features …
The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural …
D Zou, Y Cao, Y Li, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain …
Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication …
Z Li, Z Song, T Zhou - arXiv preprint arXiv:2303.15725, 2023 - arxiv.org
In modern machine learning, attention computation is a fundamental task for training large language models such as Transformer, GPT-4 and ChatGPT. In this work, we study …
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural …
W Huang, Y Shi, Z Cai, T Suzuki - The Twelfth International …, 2023 - openreview.net
Federated Learning (FL) has attracted significant attention as an efficient privacy-preserving approach to distributed learning across multiple clients. Despite extensive empirical …
X Meng, D Zou, Y Cao - arXiv preprint arXiv:2310.01975, 2023 - arxiv.org
Modern deep learning models are usually highly over-parameterized so that they can overfit the training data. Surprisingly, such overfitting neural networks can usually still achieve high …