Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek …
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories …
L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse-engineering the computational …
F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization--identifying the important model components--is a key …
Mechanistic interpretability seeks to understand the neural mechanisms that enable specific behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While …
A common method to study deep learning systems is to use simplified model representations—for example, using singular value decomposition to visualize the model's …
In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads …
Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism …
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations …