What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM …
The rise of the term" mechanistic interpretability" has accompanied increasing interest in understanding neural models--particularly language models. However, this jargon has also …
A Elhelo, M Geva - arXiv preprint arXiv:2412.11965, 2024 - arxiv.org
Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference …
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function …
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition …
B Cywiński, K Deja - arXiv preprint arXiv:2501.18052, 2025 - arxiv.org
Recent machine unlearning approaches offer promising solution for removing unwanted concepts from diffusion models. However, traditional methods, which largely rely on fine …
Elucidating the rationale behind neural models' outputs has been challenging in the machine learning field, which is indeed applicable in this age of large language models …