Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

L Gao, X Zhang, P Nakov, X Chen - arXiv preprint arXiv:2412.17034, 2024 - arxiv.org
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive
LLMs to generate harmful text. Yet, there is still insufficient understanding of how …

Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

J Fonseca, A Bell, J Stoyanovich - arXiv preprint arXiv:2501.02018, 2025 - arxiv.org
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or
adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been …

[PDF][PDF] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

X Yi, Y Li, L Wang, X Wang, L He - arXiv preprint arXiv:2501.10639, 2025 - arxiv.org
Ensuring safety alignment has become a critical requirement for large language models
(LLMs), particularly given their widespread deployment in real-world applications. However …