Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Defining and characterizing reward gaming

J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc
We provide the first formal definition of\textbf {reward hacking}, a phenomenon where
optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

T Everitt, M Hutter, R Kumar, V Krakovna - Synthese, 2021 - Springer
Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding?
Or will sufficiently capable RL agents always find ways to bypass their intended objectives …

TASRA: a taxonomy and analysis of societal-scale risks from AI

A Critch, S Russell - arXiv preprint arXiv:2306.06924, 2023 - arxiv.org
While several recent works have identified societal-scale and extinction-level risks to
humanity arising from artificial intelligence, few have attempted an {\em exhaustive …

Optimal policies tend to seek power

AM Turner, L Smith, R Shah, A Critch… - arXiv preprint arXiv …, 2019 - arxiv.org
Some researchers speculate that intelligent reinforcement learning (RL) agents would be
incentivized to seek resources and power in pursuit of their objectives. Other researchers …

Avoiding side effects by considering future tasks

V Krakovna, L Orseau, R Ngo… - Advances in Neural …, 2020 - proceedings.neurips.cc
Designing reward functions is difficult: the designer has to specify what to do (what it means
to complete the task) as well as what not to do (side effects that should be avoided while …

Human control: definitions and algorithms

R Carey, T Everitt - Uncertainty in Artificial Intelligence, 2023 - proceedings.mlr.press
How can humans stay in control of advanced artificial intelligence systems? One proposal is
corrigibility, which requires the agent to follow the instructions of a human overseer, without …

Avoiding side effects in complex environments

A Turner, N Ratzlaff, P Tadepalli - Advances in Neural …, 2020 - proceedings.neurips.cc
Reward function specification can be difficult. Rewarding the agent for making a widget may
be easy, but penalizing the multitude of possible negative side effects is hard. In toy …

Penalizing side effects using stepwise relative reachability

V Krakovna, L Orseau, R Kumar, M Martic… - arXiv preprint arXiv …, 2018 - arxiv.org
How can we design safe reinforcement learning agents that avoid unnecessary disruptions
to their environment? We show that current approaches to penalizing side effects can …

Reinforcement learning under moral uncertainty

A Ecoffet, J Lehman - International conference on machine …, 2021 - proceedings.mlr.press
An ambitious goal for machine learning is to create agents that behave ethically: The
capacity to abide by human moral norms would greatly expand the context in which …