Localizing model behavior with path patching

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc

Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

被引用次数：133 相关文章所有 6 个版本

[PDF] arxiv.org

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arXiv preprint arXiv …, 2023 - arxiv.org

Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

被引用次数：81 相关文章所有 3 个版本

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：47 相关文章所有 3 个版本

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arXiv preprint arXiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

被引用次数：20 相关文章所有 2 个版本

[PDF] arxiv.org

Towards best practices of activation patching in language models: Metrics and methods

F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org

Mechanistic interpretability seeks to understand the internal mechanisms of machine
learning models, where localization--identifying the important model components--is a key …

被引用次数：39 相关文章所有 4 个版本

[PDF] thecvf.com

Towards vision-language mechanistic interpretability: A causal tracing tool for blip

V Palit, R Pandey, A Arora… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Mechanistic interpretability seeks to understand the neural mechanisms that enable specific
behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While …

被引用次数：8 相关文章所有 5 个版本

[PDF] openreview.net

Interpretability illusions in the generalization of simplified models

D Friedman, AK Lampinen, L Dixon… - … on Machine Learning, 2023 - openreview.net

A common method to study deep learning systems is to use simplified model
representations—for example, using singular value decomposition to visualize the model's …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Successor heads: Recurring, interpretable attention heads in the wild

R Gould, E Ong, G Ogden, A Conmy - arXiv preprint arXiv:2312.09230, 2023 - arxiv.org

In this work we present successor heads: attention heads that increment tokens with a
natural ordering, such as numbers, months, and days. For example, successor heads …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Circuit component reuse across tasks in transformer language models

J Merullo, C Eickhoff, E Pavlick - arXiv preprint arXiv:2310.08744, 2023 - arxiv.org

Recent work in mechanistic interpretability has shown that behaviors in language models
can be successfully reverse-engineered through circuit analysis. A common criticism …

被引用次数：22 相关文章所有 3 个版本

[PDF] arxiv.org

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arXiv preprint arXiv …, 2024 - arxiv.org

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群