Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …
A Orvieto, L Xiao - arXiv preprint arXiv:2407.04358, 2024 - arxiv.org
We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each …
The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a high positive …
H Sivan, M Gabel, A Schuster - arXiv preprint arXiv:2302.08484, 2023 - arxiv.org
Though second-order optimization methods are highly effective, popular approaches in machine learning such as SGD and Adam use only first-order information due to the difficulty …
The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the …
Deep learning has fundamentally transformed the field of image synthesis, facilitated by the emergence of generative models that demonstrate remarkable ability to generate …
Deep learning technologies are skyrocketing in popularity across a wide range of domains, with groundbreaking accomplishments in fields such as natural language processing …
We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficul-ties in optimization dynamics. When training with gradient descent, the loss …