Mechanics of next token prediction with self-attention

G Bachmann, V Nagarajan - arXiv preprint arXiv:2403.06963, 2024 - arxiv.org

Can a mere next-token predictor faithfully model human intelligence? We crystallize this
intuitive concern, which is fragmented in the literature. As a starting point, we argue that the …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

ME Ildiz, Y Huang, Y Li, AS Rawat, S Oymak - arXiv preprint arXiv …, 2024 - arxiv.org

Modern language models rely on the transformer architecture and attention mechanism to
perform language understanding and text generation. In this work, we study learning a 1 …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

C Zheng, W Huang, R Wang, G Wu, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

Autoregressively trained transformers have brought a profound revolution to the world,
especially with their in-context learning (ICL) ability to address downstream tasks. Recently …

相关文章所有 2 个版本

[PDF] arxiv.org

On the Power of Convolution Augmented Transformer

M Li, X Zhang, Y Huang, S Oymak - arXiv preprint arXiv:2407.05591, 2024 - arxiv.org

The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Y Jiang, G Rajendran, P Ravikumar… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …

Self-attention Networks Localize When QK-eigenspectrum Concentrates

H Bao, R Hataya, R Karakida - arXiv preprint arXiv:2402.02098, 2024 - arxiv.org

The self-attention mechanism prevails in modern machine learning. It has an interesting
functionality of adaptively selecting tokens from an input sequence by modulating the …

Upper and lower memory capacity bounds of transformers for next-token prediction

L Madden, C Fox, C Thrampoulidis - arXiv preprint arXiv:2405.13718, 2024 - arxiv.org

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the
next-token conditional probability distribution. Decoder-only transformers have become …

相关文章所有 2 个版本

[PDF] arxiv.org

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

G Liu, H Mao, J Tang, KM Johnson - arXiv preprint arXiv:2407.15286, 2024 - arxiv.org

Large Language Models (LLMs) are capable of producing content that perpetuates
stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a …

相关文章所有 2 个版本

[PDF] arxiv.org

Climbing the Complexity Ladder with Expressive Attention

C Gros - arXiv preprint arXiv:2407.18601, 2024 - arxiv.org

Attention involves comparing query and key vectors in terms of a scalar product, $\mathbf
{Q}^ T\mathbf {K} $, together with a subsequent softmax normalization. Classicaly …

相关文章所有 2 个版本

[PDF] openreview.net

How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

H Li, M Wang, S Lu, X Cui, PY Chen - High-dimensional Learning …, 2024 - openreview.net

Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with …

高级搜索

QQ 群