On large language models' resilience to coercive interrogation

Z Zhang, G Shen, G Tao, S Cheng… - 2024 IEEE Symposium on …, 2024 - computer.org
Abstract Large Language Models (LLMs) are increasingly employed in numerous
applications. It is hence important to ensure that their ethical standard aligns with humans' …

[PDF][PDF] BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

G Shen, S Cheng, Z Zhang, G Tao, K Zhang… - 2025 IEEE Symposium …, 2024 - cs.purdue.edu
Recent literature has shown that LLMs are vulnerable to backdoor attacks, where malicious
attackers inject a secret token sequence (ie, trigger) into training prompts and enforce their …

Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

W Jiang, Z Wang, J Zhai, S Ma, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and
unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall …

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

H Wang, H Li, J Zhu, X Wang, C Pan, ML Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are susceptible to generating harmful content when
prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs …

ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs

L Yan, S Cheng, X Chen, K Zhang, G Shen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have become integral to many applications, with system
prompts serving as a key mechanism to regulate model behavior and ensure ethical outputs …

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

X Chen, Y Nie, L Yan, Y Mao, W Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern large language model (LLM) developers typically conduct a safety alignment to
prevent an LLM from generating unethical or harmful content. Recent studies have …

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

X Chen, Y Nie, W Guo, X Zhang - arXiv preprint arXiv:2406.08705, 2024 - arxiv.org
Recent studies developed jailbreaking attacks, which construct jailbreaking prompts
to``fool''LLMs into responding to harmful questions. Early-stage jailbreaking attacks require …

SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization

H Guo, S Cheng, G Tao, G Shen, Z Zhang, S An… - Red Teaming GenAI … - openreview.net
Large Language Models (LLMs) have become increasingly impactful across various
domains, including coding and data analysis. However, their widespread adoption has …

UNDERSTANDING AND ENHANCING THE TRANSFER-ABILITY OF JAILBREAKING ATTACKS

AOFJ ATTACKS - openreview.net
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs)
to produce harmful responses. Nevertheless, these attacks exhibit limited transferability …