Towards understanding sharpness-aware minimization

M Andriushchenko… - … Conference on Machine …, 2022 - proceedings.mlr.press
Abstract Sharpness-Aware Minimization (SAM) is a recent training method that relies on
worst-case weight perturbations which significantly improves generalization in various …

Surrogate gap minimization improves sharpness-aware training

J Zhuang, B Gong, L Yuan, Y Cui, H Adam… - arXiv preprint arXiv …, 2022 - arxiv.org
The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by
minimizing a\textit {perturbed loss} defined as the maximum loss within a neighborhood in …

On large-cohort training for federated learning

Z Charles, Z Garrett, Z Huo… - Advances in neural …, 2021 - proceedings.neurips.cc
Federated learning methods typically learn a model by iteratively sampling updates from a
population of clients. In this work, we explore how the number of clients sampled at each …

Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data

T Lin, SP Karimireddy, SU Stich, M Jaggi - arXiv preprint arXiv:2102.04761, 2021 - arxiv.org
Decentralized training of deep learning models is a key element for enabling data privacy
and on-device learning over networks. In realistic learning scenarios, the presence of …

Information-theoretic generalization bounds for stochastic gradient descent

G Neu, GK Dziugaite, M Haghifam… - … on Learning Theory, 2021 - proceedings.mlr.press
We study the generalization properties of the popular stochastic optimization method known
as stochastic gradient descent (SGD) for optimizing general non-convex loss functions. Our …

[HTML][HTML] Deep learning for molecules and materials

AD White - Living journal of computational molecular science, 2022 - ncbi.nlm.nih.gov
Deep learning is becoming a standard tool in chemistry and materials science. Although
there are learning materials available for deep learning, none cover the applications in …

AdaER: An adaptive experience replay approach for continual lifelong learning

X Li, B Tang, H Li - Neurocomputing, 2024 - Elsevier
Continual lifelong learning is an machine learning framework inspired by human learning,
where learners are trained to continuously acquire new knowledge in a sequential manner …

Why (and When) does Local SGD Generalize Better than SGD?

X Gu, K Lyu, L Huang, S Arora - arXiv preprint arXiv:2303.01215, 2023 - arxiv.org
Local SGD is a communication-efficient variant of SGD for large-scale training, where
multiple GPUs perform SGD independently and average the model parameters periodically …

Implicit gradient alignment in distributed and federated learning

Y Dandi, L Barba, M Jaggi - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
A major obstacle to achieving global convergence in distributed and federated learning is
the misalignment of gradients across clients or mini-batches due to heterogeneity and …

Achieving small-batch accuracy with large-batch scalability via Hessian-aware learning rate adjustment

S Lee, C He, S Avestimehr - Neural Networks, 2023 - Elsevier
We consider synchronous data-parallel neural network training with a fixed large batch size.
While the large batch size provides a high degree of parallelism, it degrades the …