Representation engineering: A top-down approach to AI transparency. CoRR, abs/2310.01405,...

N Zhang, Y Yao, B Tian, P Wang, S Deng… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …

被引用次数：30 相关文章所有 2 个版本

[PDF] arxiv.org

Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

被引用次数：5 相关文章所有 3 个版本

[PDF] acm.org

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models

N Li, Y Li, Y Liu, L Shi, K Wang, H Wang - Proceedings of the ACM on …, 2024 - dl.acm.org

Large language models (LLMs) have revolutionized language processing, but face critical
challenges with security, privacy, and generating hallucinations—coherent but factually …

被引用次数：8 相关文章

[PDF] arxiv.org

Interpreting attention layer outputs with sparse autoencoders

C Kissane, R Krzyzanowski, JI Bloom, A Conmy… - arXiv preprint arXiv …, 2024 - arxiv.org

Decomposing model activations into interpretable components is a key open problem in
mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Can AI Assistants Know What They Don't Know?

Q Cheng, T Sun, X Liu, W Zhang, Z Yin, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, AI assistants based on large language models (LLMs) show surprising
performance in many tasks, such as dialogue, solving math problems, writing code, and …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Steering without side effects: Improving post-deployment control of language models

AC Stickland, A Lyzhov, J Pfau, S Mahdi… - arXiv preprint arXiv …, 2024 - arxiv.org

Language models (LMs) have been shown to behave unexpectedly post-deployment. For
example, new jailbreaks continually arise, allowing model misuse, despite extensive red …

被引用次数：7 相关文章所有 2 个版本

[PDF] acm.org

Large language model supply chain: A research agenda

S Wang, Y Zhao, X Hou, H Wang - ACM Transactions on Software …, 2024 - dl.acm.org

The rapid advancement of large language models (LLMs) has revolutionized artificial
intelligence, introducing unprecedented capabilities in natural language processing and …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Improving steering vectors by targeting sparse autoencoder features

S Chalnev, M Siu, A Conmy - arXiv preprint arXiv:2411.02193, 2024 - arxiv.org

To control the behavior of language models, steering methods attempt to ensure that outputs
of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Open Problems in Mechanistic Interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey… - arXiv preprint arXiv …, 2025 - arxiv.org

Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …

被引用次数：1 相关文章

[PDF] arxiv.org

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

J Chua, E Rees, H Batra, SR Bowman… - arXiv preprint arXiv …, 2024 - arxiv.org

While chain-of-thought prompting (CoT) has the potential to improve the explainability of
language model reasoning, it can systematically misrepresent the factors influencing …

被引用次数：8 相关文章所有 2 个版本

高级搜索

QQ 群