In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD) …
E Abdelaleem - arXiv preprint arXiv:2410.19867, 2024 - arxiv.org
The quest for simplification in physics drives the exploration of concise mathematical representations for complex systems. This Dissertation focuses on the concept of …
K Park, S Lee - arXiv preprint arXiv:2412.08894, 2024 - arxiv.org
We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate …
S Iwase, S Takahashi, N Inoue, R Yokota… - … Conference on Pattern …, 2025 - Springer
The double descent phenomenon, which deviates from the traditional bias-variance trade-off theory, attracts considerable research attention; however, the mechanism of its occurrence is …
The practical applications of neural networks are vast and varied, yet a comprehensive understanding of their underlying principles remains incomplete. This dissertation advances …
Z Zhang, P Lin, Z Wang, Y Zhang, ZQJ Xu - The Thirty-eighth Annual … - openreview.net
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we …