Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org
Because" out-of-the-box" large language models are capable of generating a great deal of
objectionable content, recent work has focused on aligning these models in an attempt to …

Jailbroken: How does llm safety training fail?

A Wei, N Haghtalab… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models trained for safety and harmlessness remain susceptible to
adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc
Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Catastrophic jailbreak of open-source llms via exploiting generation

Y Huang, S Gupta, M Xia, K Li, D Chen - arXiv preprint arXiv:2310.06987, 2023 - arxiv.org
The rapid progress in open-source large language models (LLMs) is significantly advancing
AI development. Extensive efforts have been made before model release to align their …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Visual adversarial examples jailbreak aligned large language models

X Qi, K Huang, A Panda, P Henderson… - Proceedings of the …, 2024 - ojs.aaai.org
Warning: this paper contains data, prompts, and model outputs that are offensive in nature.
Recently, there has been a surge of interest in integrating vision into Large Language …

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Y Yuan, W Jiao, W Wang, J Huang, P He, S Shi… - arXiv preprint arXiv …, 2023 - arxiv.org
Safety lies at the core of the development of Large Language Models (LLMs). There is
ample work on aligning LLMs with human ethics and preferences, including data filtering in …

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arXiv preprint arXiv …, 2024 - arxiv.org
Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …