Threats, attacks, and defenses in machine unlearning: A survey

Z Liu, H Ye, C Chen, Y Zheng, KY Lam - arXiv preprint arXiv:2403.13682, 2024 - arxiv.org
Machine Unlearning (MU) has recently gained considerable attention due to its potential to
achieve Safe AI by removing the influence of specific data from trained Machine Learning …

Improving alignment and robustness with circuit breakers

A Zou, L Phan, J Wang, D Duenas, M Lin… - The Thirty-eighth …, 2024 - openreview.net
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We
present an approach, inspired by recent advances in representation engineering, that …

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Trustworthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations

C Chen, Z Liu, W Jiang, SQ Goh, KKY Lam - arXiv preprint arXiv …, 2024 - arxiv.org
AI Safety is an emerging area of critical importance to the safe adoption and deployment of
AI systems. With the rapid proliferation of AI and especially with the recent advancement of …

Efficient adversarial training in llms with continuous attacks

S Xhonneux, A Sordoni, S Günnemann, G Gidel… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their
safety guardrails. In many domains, adversarial training has proven to be one of the most …

Soft prompts go hard: Steering visual language models with hidden meta-instructions

T Zhang, C Zhang, JX Morris, E Bagdasarian… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a new type of indirect injection attacks against language models that operate
on images: hidden''meta-instructions''that influence how the model interprets the image and …

Multilingual blending: Llm safety alignment evaluation with language mixture

J Song, Y Huang, Z Zhou, L Ma - arXiv preprint arXiv:2407.07342, 2024 - arxiv.org
As safety remains a crucial concern throughout the development lifecycle of Large
Language Models (LLMs), researchers and industrial practitioners have increasingly …

Uncovering safety risks of large language models through concept activation vector

Z Xu, R Huang, C Chen, X Wang - The Thirty-eighth Annual …, 2024 - openreview.net
Despite careful safety alignment, current large language models (LLMs) remain vulnerable
to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept …

A probabilistic perspective on unlearning and alignment for large language models

Y Scholten, S Günnemann, L Schwinn - arXiv preprint arXiv:2410.03523, 2024 - arxiv.org
Comprehensive evaluation of Large Language Models (LLMs) is an open research problem.
Existing evaluations rely on deterministic point estimates generated via greedy decoding …

Evaluating the cybersecurity robustness of commercial llms against adversarial prompts: A promptbench analysis

T Goto, K Ono, A Morita - Authorea Preprints, 2024 - techrxiv.org
This study presents a comprehensive evaluation of the cybersecurity robustness of five
leading Large Language Models (LLMs)-ChatGPT-4, Google Gemini, Anthropic Claude …