Copy suppression: Comprehensively understanding an attention head

C McDougall, A Conmy, C Rushing, T McGrath… - arXiv preprint arXiv …, 2023 - arxiv.org
We present a single attention head in GPT-2 Small that has one main role across the entire
training distribution. If components in earlier layers predict a certain token, and this token …

Successor heads: Recurring, interpretable attention heads in the wild

R Gould, E Ong, G Ogden, A Conmy - arXiv preprint arXiv:2312.09230, 2023 - arxiv.org
In this work we present successor heads: attention heads that increment tokens with a
natural ordering, such as numbers, months, and days. For example, successor heads …

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arXiv preprint arXiv:2310.10348, 2023 - arxiv.org
Automated interpretability research has recently attracted attention as a potential research
direction that could scale explanations of neural network behavior to large models. Existing …

Information flow routes: Automatically interpreting language models at scale

J Ferrando, E Voita - arXiv preprint arXiv:2403.00824, 2024 - arxiv.org
Information flows by routes inside the network via mechanisms implemented in the model.
These routes can be represented as graphs where nodes correspond to token …

Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt

Z He, X Ge, Q Tang, T Sun, Q Cheng, X Qiu - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse dictionary learning has been a rapidly growing technique in mechanistic
interpretability to attack superposition and extract more human-understandable features …

Transformer circuit evaluation metrics are not robust

J Miller, B Chughtai, W Saunders - First Conference on Language …, 2024 - openreview.net
Mechanistic interpretability work attempts to reverse engineer the learned algorithms
present inside neural networks. One focus of this work has been to discover'circuits' …

Automatically Identifying and Interpreting Sparse Circuits with Hierarchical Tracing

X Ge, W Shu, J Wang, F Zhu, Z He, X Qiu - openreview.net
We present a novel approach to Transformer circuit analysis using Sparse Autoencoders
(SAEs) and Transcoders. SAEs allow fine-grained feature extraction from model activations …