Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders

Z He, W Shu, X Ge, L Chen, J Wang, Y Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for
extracting sparse representations from language models, yet scalable training remains a …

Obfuscated Activations Bypass LLM Latent-Space Defenses

L Bailey, A Serrano, A Sheshadri, M Seleznyov… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent latent-space monitoring techniques have shown promise as defenses against LLM
attacks. These defenses act as scanners that seek to detect harmful activations before they …

Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

D Ghilardi, F Belotti, M Molinari - arXiv preprint arXiv:2410.21508, 2024 - arxiv.org
Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach
for understanding the inner workings of Large Language Models (LLMs). They reconstruct …

Efficient Training of Sparse Autoencoders for Large Language Models via Layer Clustering

M Molinari, D Ghilardi, F Belotti - openreview.net
Sparse Autoencoders (SAEs) have recently been employed as an unsupervised approach
for understanding the inner workings of Large Language Models (LLMs). They reconstruct …