Rain: Your language models can align themselves without finetuning

BY Lin, A Ravichander, X Lu, N Dziri… - The Twelfth …, 2023 - openreview.net

Alignment tuning has become the de facto standard practice for enabling base large
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …

被引用次数：69 相关文章所有 3 个版本

[PDF] arxiv.org

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arXiv preprint arXiv …, 2024 - arxiv.org

Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …

被引用次数：79 相关文章所有 3 个版本

[PDF] arxiv.org

Low-resource languages jailbreak gpt-4

ZX Yong, C Menghini, SH Bach - arXiv preprint arXiv:2310.02446, 2023 - arxiv.org

AI safety training and red-teaming of large language models (LLMs) are measures to
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …

被引用次数：97 相关文章所有 4 个版本

[PDF] arxiv.org

Llm self defense: By self examination, llms know they are being tricked

M Phute, A Helbling, M Hull, SY Peng, S Szyller… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) are popular for high-quality text generation but can produce
harmful content, even when aligned with human values through reinforcement learning …

被引用次数：61 相关文章所有 2 个版本

[PDF] arxiv.org

Shadow alignment: The ease of subverting safely-aligned language models

X Yang, X Wang, Q Zhang, L Petzold, WY Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Warning: This paper contains examples of harmful language, and reader discretion is
recommended. The increasing open release of powerful large language models (LLMs) has …

被引用次数：71 相关文章所有 4 个版本

[PDF] arxiv.org

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arXiv preprint arXiv …, 2024 - arxiv.org

Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

被引用次数：57 相关文章所有 3 个版本

[PDF] arxiv.org

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

被引用次数：30 相关文章所有 4 个版本

[PDF] arxiv.org

Defending large language models against jailbreaking attacks through goal prioritization

Z Zhang, J Yang, P Ke, M Huang - arXiv preprint arXiv:2311.09096, 2023 - arxiv.org

Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is
accompanied by a growing array of safety risks. While significant attention has been …

被引用次数：35 相关文章所有 2 个版本

[PDF] arxiv.org

Self-prompting large language models for zero-shot open-domain QA

J Li, Z Zhang, H Zhao - arXiv preprint arXiv:2212.08635, 2022 - arxiv.org

Open-Domain Question Answering (ODQA) aims at answering factoid questions without
explicitly providing specific background documents. In a zero-shot setting, this task is more …

被引用次数：39 相关文章所有 3 个版本

[PDF] arxiv.org

Safedecoding: Defending against jailbreak attacks via safety-aware decoding

Z Xu, F Jiang, L Niu, J Jia, BY Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) become increasingly integrated into real-world
applications such as code generation and chatbot assistance, extensive efforts have been …

被引用次数：32 相关文章所有 4 个版本

高级搜索

QQ 群