Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing …
Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …
Abstract Concept erasure aims to remove specified features from a representation. It can improve fairness (eg preventing a classifier from using gender or race) and interpretability …
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 …
\emph {Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state …
Although deep reinforcement learning has become a promising machine learning approach for sequential decision-making problems, it is still not mature enough for high-stake domains …
Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks …
F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization--identifying the important model components--is a key …
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard …