Dog is sgd’s best friend: A parameter-free dynamic step size schedule

M Wortsman, PJ Liu, L Xiao, K Everett, A Alemi… - arXiv preprint arXiv …, 2023 - arxiv.org

Teams that have trained large Transformer-based models have reported training instabilities
at large scale that did not appear when training with the same hyperparameters at smaller …

被引用次数：47 相关文章所有 3 个版本

[PDF] arxiv.org

Prodigy: An expeditiously adaptive parameter-free learner

K Mishchenko, A Defazio - arXiv preprint arXiv:2306.06101, 2023 - arxiv.org

We consider the problem of estimating the learning rate in adaptive methods, such as
AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance …

被引用次数：36 相关文章所有 4 个版本

[PDF] neurips.cc

Dowg unleashed: An efficient universal parameter-free gradient descent method

A Khaled, K Mishchenko, C Jin - Advances in Neural …, 2023 - proceedings.neurips.cc

This paper proposes a new easy-to-implement parameter-free gradient-based optimizer:
DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient---matching the …

被引用次数：21 相关文章所有 8 个版本

[PDF] neurips.cc

Mechanic: A learning rate tuner

A Cutkosky, A Defazio, H Mehta - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce a technique for tuning the learning rate scale factor of any base optimization
algorithm and schedule automatically, which we call Mechanic. Our method provides a …

被引用次数：13 相关文章所有 6 个版本

[PDF] arxiv.org

Adaptive proximal gradient method for convex optimization

Y Malitsky, K Mishchenko - arXiv preprint arXiv:2308.02261, 2023 - arxiv.org

In this paper, we explore two fundamental first-order algorithms in convex optimization,
namely, gradient descent (GD) and proximal gradient method (ProxGD). Our focus is on …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

The price of adaptivity in stochastic convex optimization

Y Carmon, O Hinder - arXiv preprint arXiv:2402.10898, 2024 - arxiv.org

We prove impossibility results for adaptivity in non-smooth stochastic convex optimization.
Given a set of problem parameters we wish to adapt to, we define a" price of adaptivity"(PoA) …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

A simple uniformly optimal method without line search for convex optimization

T Li, G Lan - arXiv preprint arXiv:2310.10082, 2023 - arxiv.org

Line search (or backtracking) procedures have been widely employed into first-order
methods for solving convex optimization problems, especially those with unknown problem …

被引用次数：16 相关文章所有 2 个版本

[PDF] mlr.press

On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms

P Latafat, A Themelis… - 6th Annual Learning for …, 2024 - proceedings.mlr.press

Building upon recent works on linesearch-free adaptive proximal gradient methods, this
paper proposes AdaPG, a framework that unifies and extends existing results by providing …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Revisiting the last-iterate convergence of stochastic gradient methods

Z Liu, Z Zhou - arXiv preprint arXiv:2312.08531, 2023 - arxiv.org

In the past several years, the convergence of the last iterate of the Stochastic Gradient
Descent (SGD) algorithm has triggered people's interest due to its good performance in …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms

F Abdukhakimov, C Xiang, D Kamzolov… - arXiv preprint arXiv …, 2023 - arxiv.org

Adaptive optimization methods are widely recognized as among the most popular
approaches for training Deep Neural Networks (DNNs). Techniques such as Adam …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群