Dichotomy of early and late phase implicit biases can provably induce grokking

K Lyu, J Jin, Z Li, SS Du, JD Lee… - The Twelfth International …, 2023 - openreview.net
Recent work by Power et al.(2022) highlighted a surprising" grokking" phenomenon in
learning arithmetic tasks: a neural net first" memorizes" the training set, resulting in perfect …

On the unreasonable effectiveness of federated averaging with heterogeneous data

J Wang, R Das, G Joshi, S Kale, Z Xu… - arXiv preprint arXiv …, 2022 - arxiv.org
Existing theory predicts that data heterogeneity will degrade the performance of the
Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the …

Fast mixing of stochastic gradient descent with normalization and weight decay

Z Li, T Wang, D Yu - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract We prove the Fast Equilibrium Conjecture proposed by Li et al.,(2020), ie,
stochastic gradient descent (SGD) on a scale-invariant loss (eg, using networks with various …

Decentralized SGD and average-direction SAM are asymptotically equivalent

T Zhu, F He, K Chen, M Song… - … Conference on Machine …, 2023 - proceedings.mlr.press
Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on
massive devices simultaneously without the control of a central server. However, existing …

Diloco: Distributed low-communication training of language models

A Douillard, Q Feng, AA Rusu, R Chhaparia… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLM) have become a critical component in many applications of
machine learning. However, standard approaches to training LLM require a large number of …

The marginal value of momentum for small learning rate sgd

R Wang, S Malladi, T Wang, K Lyu, Z Li - arXiv preprint arXiv:2307.15196, 2023 - arxiv.org
Momentum is known to accelerate the convergence of gradient descent in strongly convex
settings without stochastic gradient noise. In stochastic optimization, such as training neural …

Near-optimal fully first-order algorithms for finding stationary points in bilevel optimization

L Chen, Y Ma, J Zhang - arXiv preprint arXiv:2306.14853, 2023 - arxiv.org
Bilevel optimization has various applications such as hyper-parameter optimization and
meta-learning. Designing theoretically efficient algorithms for bilevel optimization is more …

Federated learning you may communicate less often!

M Sefidgaran, R Chor, A Zaidi, Y Wan - arXiv preprint arXiv:2306.05862, 2023 - arxiv.org
We investigate the generalization error of statistical learning models in a Federated
Learning (FL) setting. Specifically, we study the evolution of the generalization error with the …

Asynchronous Local-SGD Training for Language Modeling

B Liu, R Chhaparia, A Douillard, S Kale… - arXiv preprint arXiv …, 2024 - arxiv.org
Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is
an approach to distributed optimization where each device performs more than one SGD …

Leveraging Function Space Aggregation for Federated Learning at Scale

N Dhawan, N Mitchell, Z Charles, Z Garrett… - arXiv preprint arXiv …, 2023 - arxiv.org
The federated learning paradigm has motivated the development of methods for
aggregating multiple client updates into a global server model, without sharing client data …