Open source sparse autoencoders for all residual stream layers of GPT2 small

J Engels, EJ Michaud, I Liao, W Gurnee… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent work has proposed that language models perform computation by manipulating one-
dimensional representations of concepts (" features") in activation space. In contrast, we …

被引用次数：33 相关文章所有 2 个版本

[PDF] arxiv.org

Interpreting attention layer outputs with sparse autoencoders

C Kissane, R Krzyzanowski, JI Bloom, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org

Decomposing model activations into interpretable components is a key open problem in
mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

被引用次数：35 相关文章所有 4 个版本

[PDF] arxiv.org

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arXiv preprint arXiv …, 2024 - arxiv.org

Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

被引用次数：80 相关文章所有 3 个版本

[PDF] arxiv.org

Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small

M Chaudhary, A Geiger - arXiv preprint arXiv:2409.04478, 2024 - arxiv.org

A popular new method in mechanistic interpretability is to train high-dimensional sparse
autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Mathematical models of computation in superposition

K Hänni, J Mendel, D Vaintrob, L Chan - arXiv preprint arXiv:2408.05451, 2024 - arxiv.org

Superposition--when a neural network represents more``features''than it has dimensions--
seems to pose a serious challenge to mechanistically interpreting current AI systems …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

K Ayonrinde, MT Pearce, L Sharkey - arXiv preprint arXiv:2410.11179, 2024 - arxiv.org

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal
representations of neural networks. However, naively optimising SAEs for reconstruction …

被引用次数：1 相关文章

[PDF] arxiv.org

Investigating sensitive directions in gpt-2: An improved baseline and comparative analysis of saes

DJ Lee, S Heimersheim - arXiv preprint arXiv:2410.12555, 2024 - arxiv.org

Sensitive directions experiments attempt to understand the computational features of
Language Models (LMs) by measuring how much the next token prediction probabilities …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

D Braun, J Taylor, N Goldowsky-Dill… - arXiv preprint arXiv …, 2024 - arxiv.org

Identifying the features learned by neural networks is a core challenge in mechanistic
interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

S Boughorbel, MD Parvez, M Hawasly - arXiv preprint arXiv:2405.14277, 2024 - arxiv.org

Training LLMs in low resources languages usually utilizes data augmentation with machine
translation (MT) from English language. However, translation brings a number of challenges …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群