Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Beyond a*: Better planning with transformers via search dynamics bootstrapping

L Lehnert, S Sukhbaatar, DJ Su, Q Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
While Transformers have enabled tremendous progress in various application settings, such
architectures still trail behind traditional symbolic planners for solving complex decision …

On logical extrapolation for mazes with recurrent and implicit networks

B Knutson, AC Rabeendran, M Ivanitskiy… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent work has suggested that certain neural network architectures-particularly recurrent
neural networks (RNNs) and implicit neural networks (INNs) are capable of logical …

Transformers Can Navigate Mazes With Multi-Step Prediction

N Nolte, O Kitouni, A Williams, M Rabbat… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their remarkable success in language modeling, transformers trained to predict the
next token in a sequence struggle with long-term planning. This limitation is particularly …

Linearly Structured World Representations in Maze-Solving Transformers

M Ivanitskiy, AF Spies, T Räuker… - … of UniReps: the …, 2024 - proceedings.mlr.press
The emergence of seemingly similar representations across tasks and neural architectures
suggests that convergent properties may underlie sophisticated behavior. One form of …

Planning behavior in a recurrent neural network that plays Sokoban

A Garriga-Alonso, M Taufeeque… - ICML 2024 Workshop on …, 2024 - openreview.net
To predict how advanced neural networks generalize to novel situations, it is essential to
understand how they reason. Guez et al.(2019," An investigation of model-free planning") …