Jailbreak attacks and defenses against large language models: A survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H Jin, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

The instruction hierarchy: Training llms to prioritize privileged instructions

E Wallace, K Xiao, R Leike, L Weng, J Heidecke… - arXiv preprint arXiv …, 2024 - arxiv.org
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow
adversaries to overwrite a model's original instructions with their own malicious prompts. In …

Compromising embodied agents with contextual backdoor attacks

A Liu, Y Zhou, X Liu, T Zhang, S Liang, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have transformed the development of embodied
intelligence. By providing a few contextual demonstrations, developers can utilize the …

Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra

D Kevian, U Syed, X Guo, A Havens, G Dullerud… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we explore the capabilities of state-of-the-art large language models (LLMs)
such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control …

Glitchprober: Advancing effective detection and mitigation of glitch tokens in large language models

Z Zhang, W Bai, Y Li, MH Meng, K Wang, L Shi… - Proceedings of the 39th …, 2024 - dl.acm.org
Large language models (LLMs) have achieved unprecedented success in the field of natural
language processing. However, the black-box nature of their internal mechanisms has …

Mission impossible: A statistical perspective on jailbreaking llms

J Su, J Kempe, K Ullrich - arXiv preprint arXiv:2408.01420, 2024 - arxiv.org
Large language models (LLMs) are trained on a deluge of text data with limited quality
control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as …

Rethinking llm memorization through the lens of adversarial compression

A Schwarzschild, Z Feng, P Maini, ZC Lipton… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) trained on web-scale datasets raise substantial concerns
regarding permissible data usage. One major question is whether these models" memorize" …

A Realistic Threat Model for Large Language Model Jailbreaks

V Boreiko, A Panfilov, V Voracek, M Hein… - arXiv preprint arXiv …, 2024 - arxiv.org
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from
safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing …