Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Interpreting grokked transformers in complex modular arithmetic

H Furuta, M Gouki, Y Iwasawa, Y Matsuo - arXiv preprint arXiv:2402.16726, 2024 - arxiv.org
Grokking has been actively explored to reveal the mystery of delayed generalization.
Identifying interpretable algorithms inside the grokked models is a suggestive hint to …

Hypothesis Testing the Circuit Hypothesis in LLMs

C Shi, N Beltran-Velez, A Nazaret, C Zheng… - ICML 2024 Workshop on … - openreview.net
Large language models (LLMs) demonstrate surprising capabilities, but we do not
understand how they are implemented. One hypothesis suggests that these capabilities are …