The trustworthiness of machine learning has emerged as a critical topic in the field, encompassing various applications and research areas such as robustness, security …
Abstract Concept erasure aims to remove specified features from a representation. It can improve fairness (eg preventing a classifier from using gender or race) and interpretability …
Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods …
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general …
T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org
The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However …
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories …
End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent …
F Zhang, N Nanda - arXiv preprint arXiv:2309.16042, 2023 - arxiv.org
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization--identifying the important model components--is a key …
Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the …