AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that …
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing …
C Chen, Z Liu, W Jiang, SQ Goh, KKY Lam - arXiv preprint arXiv …, 2024 - arxiv.org
AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of …
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most …
We introduce a new type of indirect injection attacks against language models that operate on images: hidden''meta-instructions''that influence how the model interprets the image and …
As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly …
Z Xu, R Huang, C Chen, X Wang - The Thirty-eighth Annual …, 2024 - openreview.net
Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept …
Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding …
T Goto, K Ono, A Morita - Authorea Preprints, 2024 - techrxiv.org
This study presents a comprehensive evaluation of the cybersecurity robustness of five leading Large Language Models (LLMs)-ChatGPT-4, Google Gemini, Anthropic Claude …