Not all language model features are linear

J Engels, EJ Michaud, I Liao, W Gurnee… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …

Interpreting attention layer outputs with sparse autoencoders

C Kissane, R Krzyzanowski, JI Bloom, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Decomposing model activations into interpretable components is a key open problem in
mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small

M Chaudhary, A Geiger - arXiv preprint arXiv:2409.04478, 2024 - arxiv.org
A popular new method in mechanistic interpretability is to train high-dimensional sparse
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …

Mathematical models of computation in superposition

K Hänni, J Mendel, D Vaintrob, L Chan - arXiv preprint arXiv:2408.05451, 2024 - arxiv.org
Superposition--when a neural network represents more``features''than it has dimensions--
seems to pose a serious challenge to mechanistically interpreting current AI systems …

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

K Ayonrinde, MT Pearce, L Sharkey - arXiv preprint arXiv:2410.11179, 2024 - arxiv.org
Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal
representations of neural networks. However, naively optimising SAEs for reconstruction …

Investigating sensitive directions in gpt-2: An improved baseline and comparative analysis of saes

DJ Lee, S Heimersheim - arXiv preprint arXiv:2410.12555, 2024 - arxiv.org
Sensitive directions experiments attempt to understand the computational features of
Language Models (LMs) by measuring how much the next token prediction probabilities …

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

D Braun, J Taylor, N Goldowsky-Dill… - arXiv preprint arXiv …, 2024 - arxiv.org
Identifying the features learned by neural networks is a core challenge in mechanistic
interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary …

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

S Boughorbel, MD Parvez, M Hawasly - arXiv preprint arXiv:2405.14277, 2024 - arxiv.org
Training LLMs in low resources languages usually utilizes data augmentation with machine
translation (MT) from English language. However, translation brings a number of challenges …