The wmdp benchmark: Measuring and reducing malicious use with unlearning

N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti… - arXiv preprint arXiv …, 2024 - arxiv.org
The White House Executive Order on Artificial Intelligence highlights the risks of large
language models (LLMs) empowering malicious actors in developing biological, cyber, and …

[PDF][PDF] Managing ai risks in an era of rapid progress

Y Bengio, G Hinton, A Yao, D Song… - arXiv preprint arXiv …, 2023 - blog.biocomm.ai
In this short consensus paper, we outline risks from upcoming, advanced AI systems. We
examine large-scale social harms and malicious uses, as well as an irreversible loss of …

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Managing extreme AI risks amid rapid progress

Y Bengio, G Hinton, A Yao, D Song, P Abbeel, T Darrell… - Science, 2024 - science.org
Artificial intelligence (AI) is progressing rapidly, and companies are shifting their focus to
developing generalist AI systems that can autonomously act and pursue goals. Increases in …

Protecting society from AI misuse: when are restrictions on capabilities warranted?

M Anderljung, J Hazell, M von Knebel - AI & SOCIETY, 2024 - Springer
Artificial intelligence (AI) systems will increasingly be used to cause harm as they grow more
capable. In fact, AI systems are already starting to help automate fraudulent activities, violate …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

S Lermen, C Rogers-Smith, J Ladish - arXiv preprint arXiv:2310.20624, 2023 - arxiv.org
AI developers often apply safety alignment procedures to prevent the misuse of their AI
systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine …

Connecting the dots: Llms can infer and verbalize latent structure from disparate training data

J Treutlein, D Choi, J Betley, S Marks, C Anil… - arXiv preprint arXiv …, 2024 - arxiv.org
One way to address safety risks from large language models (LLMs) is to censor dangerous
knowledge from their training data. While this removes the explicit information, implicit …

Acceptable Use Policies for Foundation Models

K Klyman - Proceedings of the AAAI/ACM Conference on AI, Ethics …, 2024 - ojs.aaai.org
As foundation models have accumulated hundreds of millions of users, developers have
begun to take steps to prevent harmful types of uses. One salient intervention that foundation …

On the limitations of compute thresholds as a governance strategy

S Hooker - arXiv preprint arXiv:2407.05694, 2024 - arxiv.org
At face value, this essay is about understanding a fairly esoteric governance tool called
compute thresholds. However, in order to grapple with whether these thresholds will achieve …