A comprehensive study of knowledge editing for large language models

N Zhang, Y Yao, B Tian, P Wang, S Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …

Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models

N Li, Y Li, Y Liu, L Shi, K Wang, H Wang - Proceedings of the ACM on …, 2024 - dl.acm.org
Large language models (LLMs) have revolutionized language processing, but face critical
challenges with security, privacy, and generating hallucinations—coherent but factually …

Interpreting attention layer outputs with sparse autoencoders

C Kissane, R Krzyzanowski, JI Bloom, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org
Decomposing model activations into interpretable components is a key open problem in
mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for …

Can AI Assistants Know What They Don't Know?

Q Cheng, T Sun, X Liu, W Zhang, Z Yin, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, AI assistants based on large language models (LLMs) show surprising
performance in many tasks, such as dialogue, solving math problems, writing code, and …

Steering without side effects: Improving post-deployment control of language models

AC Stickland, A Lyzhov, J Pfau, S Mahdi… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models (LMs) have been shown to behave unexpectedly post-deployment. For
example, new jailbreaks continually arise, allowing model misuse, despite extensive red …

Large language model supply chain: A research agenda

S Wang, Y Zhao, X Hou, H Wang - ACM Transactions on Software …, 2024 - dl.acm.org
The rapid advancement of large language models (LLMs) has revolutionized artificial
intelligence, introducing unprecedented capabilities in natural language processing and …

Improving steering vectors by targeting sparse autoencoder features

S Chalnev, M Siu, A Conmy - arXiv preprint arXiv:2411.02193, 2024 - arxiv.org
To control the behavior of language models, steering methods attempt to ensure that outputs
of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a …

Open Problems in Mechanistic Interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey… - arXiv preprint arXiv …, 2025 - arxiv.org
Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

J Chua, E Rees, H Batra, SR Bowman… - arXiv preprint arXiv …, 2024 - arxiv.org
While chain-of-thought prompting (CoT) has the potential to improve the explainability of
language model reasoning, it can systematically misrepresent the factors influencing …