Extrapolation for large-batch training in deep learning

M Andriushchenko… - … Conference on Machine …, 2022 - proceedings.mlr.press

Abstract Sharpness-Aware Minimization (SAM) is a recent training method that relies on
worst-case weight perturbations which significantly improves generalization in various …

被引用次数：126 相关文章所有 4 个版本

[PDF] arxiv.org

Surrogate gap minimization improves sharpness-aware training

J Zhuang, B Gong, L Yuan, Y Cui, H Adam… - arXiv preprint arXiv …, 2022 - arxiv.org

The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by
minimizing a\textit {perturbed loss} defined as the maximum loss within a neighborhood in …

被引用次数：135 相关文章所有 4 个版本

[PDF] neurips.cc

On large-cohort training for federated learning

Z Charles, Z Garrett, Z Huo… - Advances in neural …, 2021 - proceedings.neurips.cc

Federated learning methods typically learn a model by iteratively sampling updates from a
population of clients. In this work, we explore how the number of clients sampled at each …

被引用次数：107 相关文章所有 11 个版本

[PDF] arxiv.org

Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data

T Lin, SP Karimireddy, SU Stich, M Jaggi - arXiv preprint arXiv:2102.04761, 2021 - arxiv.org

Decentralized training of deep learning models is a key element for enabling data privacy
and on-device learning over networks. In realistic learning scenarios, the presence of …

被引用次数：96 相关文章所有 5 个版本

[PDF] mlr.press

Information-theoretic generalization bounds for stochastic gradient descent

G Neu, GK Dziugaite, M Haghifam… - … on Learning Theory, 2021 - proceedings.mlr.press

We study the generalization properties of the popular stochastic optimization method known
as stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our …

被引用次数：81 相关文章所有 8 个版本

[HTML] nih.gov

[HTML][HTML] Deep learning for molecules and materials

AD White - Living journal of computational molecular science, 2022 - ncbi.nlm.nih.gov

Deep learning is becoming a standard tool in chemistry and materials science. Although
there are learning materials available for deep learning, none cover the applications in …

被引用次数：30 相关文章所有 6 个版本

[PDF] arxiv.org

AdaER: An adaptive experience replay approach for continual lifelong learning

X Li, B Tang, H Li - Neurocomputing, 2024 - Elsevier

Continual lifelong learning is an machine learning framework inspired by human learning,
where learners are trained to continuously acquire new knowledge in a sequential manner …

被引用次数：10 相关文章所有 4 个版本

[PDF] arxiv.org

Why (and When) does Local SGD Generalize Better than SGD?

X Gu, K Lyu, L Huang, S Arora - arXiv preprint arXiv:2303.01215, 2023 - arxiv.org

Local SGD is a communication-efficient variant of SGD for large-scale training, where
multiple GPUs perform SGD independently and average the model parameters periodically …

被引用次数：21 相关文章所有 6 个版本

[PDF] aaai.org

Implicit gradient alignment in distributed and federated learning

Y Dandi, L Barba, M Jaggi - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org

A major obstacle to achieving global convergence in distributed and federated learning is
the misalignment of gradients across clients or mini-batches due to heterogeneity and …

被引用次数：24 相关文章所有 7 个版本

[PDF] sciencedirect.com

Achieving small-batch accuracy with large-batch scalability via Hessian-aware learning rate adjustment

S Lee, C He, S Avestimehr - Neural Networks, 2023 - Elsevier

We consider synchronous data-parallel neural network training with a fixed large batch size.
While the large batch size provides a high degree of parallelism, it degrades the …

被引用次数：7 相关文章所有 4 个版本

高级搜索

QQ 群