Are aligned neural networks adversarially aligned?

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：338 相关文章所有 3 个版本

[PDF] arxiv.org

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

被引用次数：136 相关文章所有 3 个版本

[PDF] thecvf.com

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

被引用次数：964 相关文章所有 5 个版本

[PDF] arxiv.org

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org

Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …

被引用次数：662 相关文章所有 8 个版本

[PDF] arxiv.org

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：297 相关文章所有 6 个版本

[PDF] arxiv.org

Fine-tuning aligned language models compromises safety, even when users do not intend to!

X Qi, Y Zeng, T Xie, PY Chen, R Jia, P Mittal… - arXiv preprint arXiv …, 2023 - arxiv.org

Optimizing large language models (LLMs) for downstream use cases often involves the
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …

被引用次数：245 相关文章所有 4 个版本

[PDF] acm.org

Explainability for large language models: A survey

H Zhao, H Chen, F Yang, N Liu, H Deng, H Cai… - ACM Transactions on …, 2024 - dl.acm.org

Large language models (LLMs) have demonstrated impressive capabilities in natural
language processing. However, their internal mechanisms are still unclear and this lack of …

被引用次数：248 相关文章所有 5 个版本

[PDF] arxiv.org

Scalable extraction of training data from (production) language models

M Nasr, N Carlini, J Hayase, M Jagielski… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper studies extractable memorization: training data that an adversary can efficiently
extract by querying a machine learning model without prior knowledge of the training …

被引用次数：173 相关文章所有 8 个版本

[PDF] arxiv.org

Catastrophic jailbreak of open-source llms via exploiting generation

Y Huang, S Gupta, M Xia, K Li, D Chen - arXiv preprint arXiv:2310.06987, 2023 - arxiv.org

The rapid progress in open-source large language models (LLMs) is significantly advancing
AI development. Extensive efforts have been made before model release to align their …

被引用次数：144 相关文章所有 3 个版本

[PDF] nowpublishers.com

Identifying and mitigating the security risks of generative ai

C Barrett, B Boyd, E Bursztein, N Carlini… - … and Trends® in …, 2023 - nowpublishers.com

Every major technical invention resurfaces the dual-use dilemma—the new technology has
the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such …

被引用次数：58 相关文章所有 7 个版本

高级搜索

QQ 群