Interpretability at scale: Identifying causal mechanisms in alpaca

Z Wu, A Geiger, T Icard, C Potts… - Advances in Neural …, 2024 - proceedings.neurips.cc
Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …

Bridging causal discovery and large language models: A comprehensive survey of integrative approaches and future directions

G Wan, Y Wu, M Hu, Z Chu, S Li - arXiv preprint arXiv:2402.11068, 2024 - arxiv.org
Causal discovery (CD) and Large Language Models (LLMs) represent two emerging fields
of study with significant implications for artificial intelligence. Despite their distinct origins …

Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero

L Schut, N Tomasev, T McGrath, D Hassabis… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence (AI) systems have made remarkable progress, attaining super-human
performance across various domains. This presents us with an opportunity to further human …

Causal-structure driven augmentations for text ood generalization

A Feder, Y Wald, C Shi, S Saria… - Advances in Neural …, 2024 - proceedings.neurips.cc
The reliance of text classifiers on spurious correlations can lead to poor generalization at
deployment, raising concerns about their use in safety-critical domains such as healthcare …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Concept-based explainable artificial intelligence: A survey

E Poeta, G Ciravegna, E Pastor, T Cerquitelli… - arXiv preprint arXiv …, 2023 - arxiv.org
The field of explainable artificial intelligence emerged in response to the growing need for
more transparent and reliable models. However, using raw features to provide explanations …

Faithful explanations of black-box nlp models using llm-generated counterfactuals

Y Gat, N Calderon, A Feder, A Chapanin… - arXiv preprint arXiv …, 2023 - arxiv.org
Causal explanations of the predictions of NLP systems are essential to ensure safety and
establish trust. Yet, existing methods often fall short of explaining model predictions …

ScoNe: Benchmarking negation reasoning in language models with fine-tuning and in-context learning

JS She, C Potts, SR Bowman, A Geiger - arXiv preprint arXiv:2305.19426, 2023 - arxiv.org
A number of recent benchmarks seek to assess how well models handle natural language
negation. However, these benchmarks lack the controlled example paradigms that would …

Mission: Impossible language models

J Kallini, I Papadimitriou, R Futrell, K Mahowald… - arXiv preprint arXiv …, 2024 - arxiv.org
Chomsky and others have very directly claimed that large language models (LLMs) are
equally capable of learning languages that are possible and impossible for humans to learn …

A glitch in the matrix? locating and detecting language model grounding with fakepedia

G Monea, M Peyrard, M Josifoski, V Chaudhary… - ACL 2024, 2024 - hal.science
Large language models (LLMs) have an impressive ability to draw on novel information
supplied in their context. Yet the mechanisms underlying this contextual grounding remain …