Why (and When) does Local SGD Generalize Better than SGD?

K Lyu, J Jin, Z Li, SS Du, JD Lee… - The Twelfth International …, 2023 - openreview.net

Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

On the unreasonable effectiveness of federated averaging with heterogeneous data

J Wang, R Das, G Joshi, S Kale, Z Xu… - arXiv preprint arXiv …, 2022 - arxiv.org

Existing theory predicts that data heterogeneity will degrade the performance of the
Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the …

被引用次数：37 相关文章所有 3 个版本

[PDF] neurips.cc

Fast mixing of stochastic gradient descent with normalization and weight decay

Z Li, T Wang, D Yu - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Abstract We prove the Fast Equilibrium Conjecture proposed by Li et al.,(2020), ie,
stochastic gradient descent (SGD) on a scale-invariant loss (eg, using networks with various …

被引用次数：17 相关文章所有 4 个版本

[PDF] mlr.press

Decentralized SGD and average-direction SAM are asymptotically equivalent

T Zhu, F He, K Chen, M Song… - … Conference on Machine …, 2023 - proceedings.mlr.press

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on
massive devices simultaneously without the control of a central server. However, existing …

被引用次数：9 相关文章所有 11 个版本

[PDF] arxiv.org

Diloco: Distributed low-communication training of language models

A Douillard, Q Feng, AA Rusu, R Chhaparia… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLM) have become a critical component in many applications of
machine learning. However, standard approaches to training LLM require a large number of …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

The marginal value of momentum for small learning rate sgd

R Wang, S Malladi, T Wang, K Lyu, Z Li - arXiv preprint arXiv:2307.15196, 2023 - arxiv.org

Momentum is known to accelerate the convergence of gradient descent in strongly convex
settings without stochastic gradient noise. In stochastic optimization, such as training neural …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Near-optimal fully first-order algorithms for finding stationary points in bilevel optimization

L Chen, Y Ma, J Zhang - arXiv preprint arXiv:2306.14853, 2023 - arxiv.org

Bilevel optimization has various applications such as hyper-parameter optimization and
meta-learning. Designing theoretically efficient algorithms for bilevel optimization is more …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Federated learning you may communicate less often!

M Sefidgaran, R Chor, A Zaidi, Y Wan - arXiv preprint arXiv:2306.05862, 2023 - arxiv.org

We investigate the generalization error of statistical learning models in a Federated
Learning (FL) setting. Specifically, we study the evolution of the generalization error with the …

被引用次数：4 相关文章

[PDF] arxiv.org

Asynchronous Local-SGD Training for Language Modeling

B Liu, R Chhaparia, A Douillard, S Kale… - arXiv preprint arXiv …, 2024 - arxiv.org

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is
an approach to distributed optimization where each device performs more than one SGD …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Leveraging Function Space Aggregation for Federated Learning at Scale

N Dhawan, N Mitchell, Z Charles, Z Garrett… - arXiv preprint arXiv …, 2023 - arxiv.org

The federated learning paradigm has motivated the development of methods for
aggregating multiple client updates into a global server model, without sharing client data …

被引用次数：1 相关文章所有 5 个版本

高级搜索

QQ 群