In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads …
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing …
Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token …
Z He, X Ge, Q Tang, T Sun, Q Cheng, X Qiu - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features …
J Miller, B Chughtai, W Saunders - First Conference on Language …, 2024 - openreview.net
Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover'circuits' …
X Ge, W Shu, J Wang, F Zhu, Z He, X Qiu - openreview.net
We present a novel approach to Transformer circuit analysis using Sparse Autoencoders (SAEs) and Transcoders. SAEs allow fine-grained feature extraction from model activations …