A circuit for Python docstrings in a 4-layer attention-only transformer, 2023

C McDougall, A Conmy, C Rushing, T McGrath… - arXiv preprint arXiv …, 2023 - arxiv.org

We present a single attention head in GPT-2 Small that has one main role across the entire
training distribution. If components in earlier layers predict a certain token, and this token …

被引用次数：26 相关文章所有 3 个版本

[PDF] arxiv.org

Successor heads: Recurring, interpretable attention heads in the wild

R Gould, E Ong, G Ogden, A Conmy - arXiv preprint arXiv:2312.09230, 2023 - arxiv.org

In this work we present successor heads: attention heads that increment tokens with a
natural ordering, such as numbers, months, and days. For example, successor heads …

被引用次数：28 相关文章所有 4 个版本

[PDF] arxiv.org

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arXiv preprint arXiv:2310.10348, 2023 - arxiv.org

Automated interpretability research has recently attracted attention as a potential research
direction that could scale explanations of neural network behavior to large models. Existing …

被引用次数：36 相关文章所有 3 个版本

[PDF] arxiv.org

Information flow routes: Automatically interpreting language models at scale

J Ferrando, E Voita - arXiv preprint arXiv:2403.00824, 2024 - arxiv.org

Information flows by routes inside the network via mechanisms implemented in the model.
These routes can be represented as graphs where nodes correspond to token …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt

Z He, X Ge, Q Tang, T Sun, Q Cheng, X Qiu - arXiv preprint arXiv …, 2024 - arxiv.org

Sparse dictionary learning has been a rapidly growing technique in mechanistic
interpretability to attack superposition and extract more human-understandable features …

被引用次数：9 相关文章所有 2 个版本

[PDF] openreview.net

Transformer circuit evaluation metrics are not robust

J Miller, B Chughtai, W Saunders - First Conference on Language …, 2024 - openreview.net

Mechanistic interpretability work attempts to reverse engineer the learned algorithms
present inside neural networks. One focus of this work has been to discover'circuits' …

被引用次数：2 相关文章

[PDF] openreview.net

Automatically Identifying and Interpreting Sparse Circuits with Hierarchical Tracing

X Ge, W Shu, J Wang, F Zhu, Z He, X Qiu - openreview.net

We present a novel approach to Transformer circuit analysis using Sparse Autoencoders
(SAEs) and Transcoders. SAEs allow fine-grained feature extraction from model activations …

高级搜索

QQ 群