Defining and characterizing reward gaming

J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc
We provide the first formal definition of\textbf {reward hacking}, a phenomenon where
optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …

Foundation models and fair use

P Henderson, X Li, D Jurafsky, T Hashimoto… - Journal of Machine …, 2023 - jmlr.org
Existing foundation models are trained on copyrighted material. Deploying these models
can pose both legal and ethical risks when data creators fail to receive appropriate …

A general language assistant as a laboratory for alignment

A Askell, Y Bai, A Chen, D Drain, D Ganguli… - arXiv preprint arXiv …, 2021 - arxiv.org
Given the broad capabilities of large language models, it should be possible to work towards
a general-purpose, text-based assistant that is aligned with human values, meaning that it is …

Towards measuring the representation of subjective global opinions in language models

E Durmus, K Nyugen, TI Liao, N Schiefer… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) may not equitably represent diverse global perspectives on
societal issues. In this paper, we develop a quantitative framework to evaluate whose …

Unsolved problems in ml safety

D Hendrycks, N Carlini, J Schulman… - arXiv preprint arXiv …, 2021 - arxiv.org
Machine learning (ML) systems are rapidly increasing in size, are acquiring new
capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …

[HTML][HTML] Identifying and mitigating the security risks of generative ai

C Barrett, B Boyd, E Bursztein, N Carlini… - … and Trends® in …, 2023 - nowpublishers.com
Every major technical invention resurfaces the dual-use dilemma—the new technology has
the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such …

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Smoothllm: Defending large language models against jailbreaking attacks

A Robey, E Wong, H Hassani, GJ Pappas - arXiv preprint arXiv …, 2023 - arxiv.org
Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …

The alignment problem from a deep learning perspective

R Ngo, L Chan, S Mindermann - arXiv preprint arXiv:2209.00626, 2022 - arxiv.org
In coming decades, artificial general intelligence (AGI) may surpass human capabilities at
many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to …

A taxonomy of prompt modifiers for text-to-image generation

J Oppenlaender - Behaviour & Information Technology, 2023 - Taylor & Francis
Text-guided synthesis of images has become enormously popular and online communities
dedicated to text-to-image generation and art generated with Artificial Intelligence (AI) have …