J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc
We provide the first formal definition of\textbf {reward hacking}, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …
Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives …
A Critch, S Russell - arXiv preprint arXiv:2306.06924, 2023 - arxiv.org
While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive …
Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers …
Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while …
R Carey, T Everitt - Uncertainty in Artificial Intelligence, 2023 - proceedings.mlr.press
How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without …
Reward function specification can be difficult. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy …
How can we design safe reinforcement learning agents that avoid unnecessary disruptions to their environment? We show that current approaches to penalizing side effects can …
A Ecoffet, J Lehman - International conference on machine …, 2021 - proceedings.mlr.press
An ambitious goal for machine learning is to create agents that behave ethically: The capacity to abide by human moral norms would greatly expand the context in which …