Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The instruction hierarchy: Training llms to prioritize privileged instructions

E Wallace, K Xiao, R Leike, L Weng, J Heidecke… - arXiv preprint arXiv …, 2024 - arxiv.org
Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow
adversaries to overwrite a model's original instructions with their own malicious prompts. In …

Are you still on track!? Catching LLM Task Drift with Activations

S Abdelnabi, A Fay, G Cherubin, A Salem… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to
orchestrate tasks and process inputs from users and other sources. These inputs, even in a …

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

E Debenedetti, J Zhang, M Balunović… - arXiv preprint arXiv …, 2024 - arxiv.org
AI agents aim to solve complex tasks by combining text-based reasoning with external tool
calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned …

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

T Xie, X Qi, Y Zeng, Y Huang, UM Sehwag… - arXiv preprint arXiv …, 2024 - arxiv.org
Evaluating aligned large language models'(LLMs) ability to recognize and reject unsafe user
requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts …