Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Measuring progress in dictionary learning for language model interpretability with board game models

A Karvonen, B Wright, C Rager, R Angell… - arXiv preprint arXiv …, 2024 - arxiv.org
What latent features are encoded in language model (LM) representations? Recent work on
training sparse autoencoders (SAEs) to disentangle interpretable features in LM …

Mechanistic?

N Saphra, S Wiegreffe - arXiv preprint arXiv:2410.09087, 2024 - arxiv.org
The rise of the term" mechanistic interpretability" has accompanied increasing interest in
understanding neural models--particularly language models. However, this jargon has also …

Inferring Functionality of Attention Heads from their Parameters

A Elhelo, M Geva - arXiv preprint arXiv:2412.11965, 2024 - arxiv.org
Attention heads are one of the building blocks of large language models (LLMs). Prior work
on investigating their operation mostly focused on analyzing their behavior during inference …

softmax is not enough (for sharp out-of-distribution)

P Veličković, C Perivolaropoulos, F Barbero… - arXiv preprint arXiv …, 2024 - arxiv.org
A key property of reasoning systems is the ability to make sharp decisions on their input
data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function …

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

G Minegishi, H Furuta, Y Iwasawa, Y Matsuo - arXiv preprint arXiv …, 2025 - arxiv.org
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve
the interpretability of large language models (LLMs) by mapping the complex superposition …

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

B Cywiński, K Deja - arXiv preprint arXiv:2501.18052, 2025 - arxiv.org
Recent machine unlearning approaches offer promising solution for removing unwanted
concepts from diffusion models. However, traditional methods, which largely rely on fine …

Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning?

M Ye, T Kuribayashi, G Kobayashi, J Suzuki - arXiv preprint arXiv …, 2024 - arxiv.org
Elucidating the rationale behind neural models' outputs has been challenging in the
machine learning field, which is indeed applicable in this age of large language models …