The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in …
Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations—coherent but factually …
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for …
Recently, AI assistants based on large language models (LLMs) show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and …
Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red …
The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, introducing unprecedented capabilities in natural language processing and …
S Chalnev, M Siu, A Conmy - arXiv preprint arXiv:2411.02193, 2024 - arxiv.org
To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a …
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering …
While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing …