How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

H Li, M Wang, S Lu, X Cui, PY Chen - High-dimensional Learning …, 2024 - openreview.net
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with …

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features

S Bombari, M Mondelli - arXiv preprint arXiv:2402.02969, 2024 - arxiv.org
Unveiling the reasons behind the exceptional success of transformers requires a better
understanding of why attention layers are suitable for NLP tasks. In particular, such tasks …

Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

H Yang, B Kailkhura, Z Wang, Y Liang - arXiv preprint arXiv:2410.09605, 2024 - arxiv.org
Understanding the training dynamics of transformers is important to explain the impressive
capabilities behind large language models. In this work, we study the dynamics of training a …

Architecture Design: From Neural Networks to Foundation Models

G Chrysos - 2024 IEEE 11th International Conference on Data …, 2024 - ieeexplore.ieee.org
Historically, we are taught to use task-dependent architecture design and objectives to
tackle data science tasks. Counter intuitively, this dogma has been proven (partly) wrong by …

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

J Wang, S Zhang, QC He, Y Chen - arXiv preprint arXiv:2501.02573, 2025 - arxiv.org
The machine learning and data science community has made significant while dispersive
progress in accelerating transformer-based large language models (LLMs), and one …

Unraveling the Gradient Descent Dynamics of Transformers

B Song, B Han, S Zhang, J Ding, M Hong - arXiv preprint arXiv …, 2024 - arxiv.org
While the Transformer architecture has achieved remarkable success across various
domains, a thorough theoretical foundation explaining its optimization dynamics is yet to be …

Tecniche di Deep Learning per la previsione di brillamenti solari

A Pellegri - 2024 - unire.unige.it
Questa trattazione si propone di esplorare tecniche avanzate di Machine Learning e Deep
Learning applicate alla previsione dei brillamenti solari. Questi eventi, detti solar flares, sono …

Gradient Descent and Attention Models: Challenges Posed by the Softmax Function

S Tarmoun, LE MacDonald, H Min, Z Xu, R Vidal - openreview.net
Transformers have become ubiquitous in modern machine learning applications, yet their
training remains a challenging task often requiring extensive trial and error. Unlike previous …

[PDF][PDF] The Implicit Bias of Scale Factor in Attention Layer

S Zhu - one-punch24.github.io
The attention structure is essential for the success of large language models nowadays.
Many works are devoted to a better understanding of attention. In this report, we focus on the …