Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models

P Hase, M Bansal, B Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Language models learn a great quantity of factual information during pretraining,
and recent work localizes this information to specific model weights like mid-layer MLP …

Craft: Concept recursive activation factorization for explainability

T Fel, A Picard, L Bethune, T Boissin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Attribution methods are a popular class of explainability methods that use heatmaps to
depict the most important areas of an image that drive a model decision. Nevertheless …

Concept embedding analysis: A review

G Schwalbe - arXiv preprint arXiv:2203.13909, 2022 - arxiv.org
Deep neural networks (DNNs) have found their way into many applications with potential
impact on the safety, security, and fairness of human-machine-systems. Such require basic …

Interpretable image recognition by constructing transparent embedding space

J Wang, H Liu, X Wang, L Jing - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Humans usually explain their reasoning (eg classification) by dissecting the image and
pointing out the evidence from these parts to the concepts in their minds. Inspired by this …

Text-to-concept (and back) via cross-model alignment

M Moayeri, K Rezaei, M Sanjabi… - … on Machine Learning, 2023 - proceedings.mlr.press
We observe that the mapping between an image's representation in one model to its
representation in another can be learned surprisingly well with just a linear layer, even …

A holistic approach to unifying automatic concept extraction and concept importance estimation

T Fel, V Boutin, L Béthune, R Cadène… - Advances in …, 2024 - proceedings.neurips.cc
In recent years, concept-based approaches have emerged as some of the most promising
explainability methods to help us interpret the decisions of Artificial Neural Networks (ANNs) …

Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery

S Rao, S Mahajan, M Böhle, B Schiele - European Conference on …, 2024 - Springer
Abstract Concept Bottleneck Models (CBMs) have recently been proposed to address the
'black-box'problem of deep neural networks, by first mapping images to a human …

SoK: Explainable machine learning in adversarial environments

M Noppel, C Wressnegger - 2024 IEEE Symposium on Security …, 2024 - ieeexplore.ieee.org
Modern deep learning methods have long been considered black boxes due to the lack of
insights into their decision-making process. However, recent advances in explainable …

Scalable interpretability via polynomials

A Dubey, F Radenovic… - Advances in neural …, 2022 - proceedings.neurips.cc
Abstract Generalized Additive Models (GAMs) have quickly become the leading choice for
interpretable machine learning. However, unlike uninterpretable methods such as DNNs …

Disentangled explanations of neural network predictions by finding relevant subspaces

P Chormai, J Herrmann, KR Müller… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Explainable AI aims to overcome the black-box nature of complex ML models like neural
networks by generating explanations for their predictions. Explanations often take the form of …