Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc
Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

Towards vision-language mechanistic interpretability: A causal tracing tool for blip

V Palit, R Pandey, A Arora… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Mechanistic interpretability seeks to understand the neural mechanisms that enable specific
behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While …

Interpretability illusions in the generalization of simplified models

D Friedman, AK Lampinen, L Dixon… - … on Machine Learning, 2023 - openreview.net
A common method to study deep learning systems is to use simplified model
representations—for example, using singular value decomposition to visualize the model's …

Successor heads: Recurring, interpretable attention heads in the wild

R Gould, E Ong, G Ogden, A Conmy - arXiv preprint arXiv:2312.09230, 2023 - arxiv.org
In this work we present successor heads: attention heads that increment tokens with a
natural ordering, such as numbers, months, and days. For example, successor heads …

Circuit component reuse across tasks in transformer language models

J Merullo, C Eickhoff, E Pavlick - arXiv preprint arXiv:2310.08744, 2023 - arxiv.org
Recent work in mechanistic interpretability has shown that behaviors in language models
can be successfully reverse-engineered through circuit analysis. A common criticism …

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arXiv preprint arXiv …, 2024 - arxiv.org
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …