Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：34 相关文章所有 3 个版本

[PDF] arxiv.org

The instruction hierarchy: Training llms to prioritize privileged instructions

E Wallace, K Xiao, R Leike, L Weng, J Heidecke… - arXiv preprint arXiv …, 2024 - arxiv.org

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow
adversaries to overwrite a model's original instructions with their own malicious prompts. In …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Are you still on track!? Catching LLM Task Drift with Activations

S Abdelnabi, A Fay, G Cherubin, A Salem… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) are routinely used in retrieval-augmented applications to
orchestrate tasks and process inputs from users and other sources. These inputs, even in a …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

E Debenedetti, J Zhang, M Balunović… - arXiv preprint arXiv …, 2024 - arxiv.org

AI agents aim to solve complex tasks by combining text-based reasoning with external tool
calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned …

被引用次数：1 相关文章

[PDF] arxiv.org

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

T Xie, X Qi, Y Zeng, Y Huang, UM Sehwag… - arXiv preprint arXiv …, 2024 - arxiv.org

Evaluating aligned large language models'(LLMs) ability to recognize and reject unsafe user
requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts …

高级搜索

QQ 群