Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

A survey of attacks on large vision-language models: Resources, advances, and future trends

D Liu, M Yang, X Qu, P Zhou, W Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the significant development of large models in recent years, Large Vision-Language
Models (LVLMs) have demonstrated remarkable capabilities across a wide range of …

Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arXiv preprint arXiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Red-teaming for generative ai: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Image hijacks: Adversarial images can control generative models at runtime

L Bailey, E Ong, S Russell, S Emmons - arXiv preprint arXiv:2309.00236, 2023 - arxiv.org
Are foundation models secure from malicious actors? In this work, we focus on the image
input to a vision-language model (VLM). We discover image hijacks, adversarial images that …

Autodan: Automatic and interpretable adversarial attacks on large language models

S Zhu, R Zhang, B An, G Wu, J Barrow, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Safety alignment of Large Language Models (LLMs) can be compromised with manual
jailbreak attacks and (automatic) adversarial attacks. Recent work suggests that patching …

Ignore this title and HackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition

S Schulhoff, J Pinto, A Khan, LF Bouchard… - Proceedings of the …, 2023 - aclanthology.org
Abstract Large Language Models (LLMs) are increasingly being deployed in interactive
contexts that involve direct user engagement, such as chatbots and writing assistants. These …

Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms

A Sheshadri, A Ewart, P Guo, A Lynch, C Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) can often be made to behave in undesirable ways that they
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H Jin, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …