Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through...

Z Liu, H Ye, C Chen, Y Zheng, KY Lam - arXiv preprint arXiv:2403.13682, 2024 - arxiv.org

Machine Unlearning (MU) has recently gained considerable attention due to its potential to
achieve Safe AI by removing the influence of specific data from trained Machine Learning …

被引用次数：21 相关文章所有 2 个版本

[PDF] openreview.net

Improving alignment and robustness with circuit breakers

A Zou, L Phan, J Wang, D Duenas, M Lin… - The Thirty-eighth …, 2024 - openreview.net

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We
present an approach, inspired by recent advances in representation engineering, that …

被引用次数：32 相关文章

[PDF] arxiv.org

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

被引用次数：9 相关文章所有 6 个版本

[PDF] arxiv.org

Trustworthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations

C Chen, Z Liu, W Jiang, SQ Goh, KKY Lam - arXiv preprint arXiv …, 2024 - arxiv.org

AI Safety is an emerging area of critical importance to the safe adoption and deployment of
AI systems. With the rapid proliferation of AI and especially with the recent advancement of …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Efficient adversarial training in llms with continuous attacks

S Xhonneux, A Sordoni, S Günnemann, G Gidel… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their
safety guardrails. In many domains, adversarial training has proven to be one of the most …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Soft prompts go hard: Steering visual language models with hidden meta-instructions

T Zhang, C Zhang, JX Morris, E Bagdasarian… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce a new type of indirect injection attacks against language models that operate
on images: hidden''meta-instructions''that influence how the model interprets the image and …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Multilingual blending: Llm safety alignment evaluation with language mixture

J Song, Y Huang, Z Zhou, L Ma - arXiv preprint arXiv:2407.07342, 2024 - arxiv.org

As safety remains a crucial concern throughout the development lifecycle of Large
Language Models (LLMs), researchers and industrial practitioners have increasingly …

被引用次数：2 相关文章所有 3 个版本

[PDF] openreview.net

Uncovering safety risks of large language models through concept activation vector

Z Xu, R Huang, C Chen, X Wang - The Thirty-eighth Annual …, 2024 - openreview.net

Despite careful safety alignment, current large language models (LLMs) remain vulnerable
to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept …

被引用次数：2 相关文章

[PDF] arxiv.org

A probabilistic perspective on unlearning and alignment for large language models

Y Scholten, S Günnemann, L Schwinn - arXiv preprint arXiv:2410.03523, 2024 - arxiv.org

Comprehensive evaluation of Large Language Models (LLMs) is an open research problem.
Existing evaluations rely on deterministic point estimates generated via greedy decoding …

Evaluating the cybersecurity robustness of commercial llms against adversarial prompts: A promptbench analysis

T Goto, K Ono, A Morita - Authorea Preprints, 2024 - techrxiv.org

This study presents a comprehensive evaluation of the cybersecurity robustness of five
leading Large Language Models (LLMs)-ChatGPT-4, Google Gemini, Anthropic Claude …

被引用次数：4 相关文章所有 4 个版本

高级搜索

QQ 群