The unlocking spell on base llms: Rethinking alignment via in-context learning

BY Lin, A Ravichander, X Lu, N Dziri… - The Twelfth …, 2023 - openreview.net
Alignment tuning has become the de facto standard practice for enabling base large
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arXiv preprint arXiv …, 2024 - arxiv.org
Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …

Low-resource languages jailbreak gpt-4

ZX Yong, C Menghini, SH Bach - arXiv preprint arXiv:2310.02446, 2023 - arxiv.org
AI safety training and red-teaming of large language models (LLMs) are measures to
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …

Llm self defense: By self examination, llms know they are being tricked

M Phute, A Helbling, M Hull, SY Peng, S Szyller… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) are popular for high-quality text generation but can produce
harmful content, even when aligned with human values through reinforcement learning …

Shadow alignment: The ease of subverting safely-aligned language models

X Yang, X Wang, Q Zhang, L Petzold, WY Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Warning: This paper contains examples of harmful language, and reader discretion is
recommended. The increasing open release of powerful large language models (LLMs) has …

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arXiv preprint arXiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

Defending large language models against jailbreaking attacks through goal prioritization

Z Zhang, J Yang, P Ke, M Huang - arXiv preprint arXiv:2311.09096, 2023 - arxiv.org
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is
accompanied by a growing array of safety risks. While significant attention has been …

Self-prompting large language models for zero-shot open-domain QA

J Li, Z Zhang, H Zhao - arXiv preprint arXiv:2212.08635, 2022 - arxiv.org
Open-Domain Question Answering (ODQA) aims at answering factoid questions without
explicitly providing specific background documents. In a zero-shot setting, this task is more …

Safedecoding: Defending against jailbreak attacks via safety-aware decoding

Z Xu, F Jiang, L Niu, J Jia, BY Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly integrated into real-world
applications such as code generation and chatbot assistance, extensive efforts have been …