Jailbreaking and mitigation of vulnerabilities in large language models

B Peng, Z Bi, Q Niu, M Liu, P Feng, T Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have transformed artificial intelligence by advancing
natural language understanding and generation, enabling applications across fields beyond …

Defending jailbreak prompts via in-context adversarial game

Y Zhou, Y Han, H Zhuang, K Guo, Z Liang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse
applications. However, concerns regarding their security, particularly the vulnerability to …

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Q Ren, H Li, D Liu, Z Xie, X Lu, Y Qiao, L Sha… - arXiv preprint arXiv …, 2024 - arxiv.org
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn
interactions, where malicious users can obscure harmful intents across several queries. We …

Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions

Q Liu, Z Zhou, L He, Y Liu, W Zhang… - Proceedings of the 2024 …, 2024 - aclanthology.org
Large language models are susceptible to jailbreak attacks, which can result in the
generation of harmful content. While prior defenses mitigate these risks by perturbing or …

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

H Yang, L Qu, E Shareghi, G Haffari - arXiv preprint arXiv:2410.11459, 2024 - arxiv.org
Large language models (LLMs) have exhibited outstanding performance in engaging with
humans and addressing complex questions by leveraging their vast implicit knowledge and …

Smoothed Embeddings for Robust Language Models

R Hase, MRU Rashid, A Lewis, J Liu… - arXiv preprint arXiv …, 2025 - arxiv.org
Improving the safety and reliability of large language models (LLMs) is a crucial aspect of
realizing trustworthy AI systems. Although alignment methods aim to suppress harmful …

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

CT Leong, Y Cheng, K Xu, J Wang, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The existing safety alignment of Large Language Models (LLMs) is found fragile and could
be easily attacked through different strategies, such as through fine-tuning on a few harmful …

Alignment-Enhanced Decoding: Defending via Token-Level Adaptive Refining of Probability Distributions

Q Liu, Z Zhou, L He, Y Liu, W Zhang, S Su - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models are susceptible to jailbreak attacks, which can result in the
generation of harmful content. While prior defenses mitigate these risks by perturbing or …

Nested Gloss Makes Large Language Models Lost

X Li, Z Zhou, J Zhu, J Yao, T Liu, B Han - openreview.net
Large language models (LLMs) have succeeded significantly in various applications but
remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous …

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

G Zizzo, G Cornacchia, K Fraser, MZ Hameed… - Neurips Safe Generative … - openreview.net
As large language models (LLMs) become more integrated into everyday applications,
ensuring their robustness and security is increasingly critical. In particular, LLMs can be …