The pitfalls of next-token prediction

G Bachmann, V Nagarajan - arXiv preprint arXiv:2403.06963, 2024 - arxiv.org
Can a mere next-token predictor faithfully model human intelligence? We crystallize this
intuitive concern, which is fragmented in the literature. As a starting point, we argue that the …

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

ME Ildiz, Y Huang, Y Li, AS Rawat, S Oymak - arXiv preprint arXiv …, 2024 - arxiv.org
Modern language models rely on the transformer architecture and attention mechanism to
perform language understanding and text generation. In this work, we study learning a 1 …

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

C Zheng, W Huang, R Wang, G Wu, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Autoregressively trained transformers have brought a profound revolution to the world,
especially with their in-context learning (ICL) ability to address downstream tasks. Recently …

On the Power of Convolution Augmented Transformer

M Li, X Zhang, Y Huang, S Oymak - arXiv preprint arXiv:2407.05591, 2024 - arxiv.org
The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Y Jiang, G Rajendran, P Ravikumar… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …

Self-attention Networks Localize When QK-eigenspectrum Concentrates

H Bao, R Hataya, R Karakida - arXiv preprint arXiv:2402.02098, 2024 - arxiv.org
The self-attention mechanism prevails in modern machine learning. It has an interesting
functionality of adaptively selecting tokens from an input sequence by modulating the …

Upper and lower memory capacity bounds of transformers for next-token prediction

L Madden, C Fox, C Thrampoulidis - arXiv preprint arXiv:2405.13718, 2024 - arxiv.org
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the
next-token conditional probability distribution. Decoder-only transformers have become …

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

G Liu, H Mao, J Tang, KM Johnson - arXiv preprint arXiv:2407.15286, 2024 - arxiv.org
Large Language Models (LLMs) are capable of producing content that perpetuates
stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a …

Climbing the Complexity Ladder with Expressive Attention

C Gros - arXiv preprint arXiv:2407.18601, 2024 - arxiv.org
Attention involves comparing query and key vectors in terms of a scalar product, $\mathbf
{Q}^ T\mathbf {K} $, together with a subsequent softmax normalization. Classicaly …

How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

H Li, M Wang, S Lu, X Cui, PY Chen - High-dimensional Learning …, 2024 - openreview.net
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with …