Scalable agent alignment via reward modeling: a research direction

J Leike, D Krueger, T Everitt, M Martic, V Maini… - arXiv preprint arXiv …, 2018 - arxiv.org
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …

AGI safety literature review

T Everitt, G Lea, M Hutter - arXiv preprint arXiv:1805.01109, 2018 - arxiv.org
The development of Artificial General Intelligence (AGI) promises to be a major event. Along
with its many potential benefits, it also raises serious safety concerns (Bostrom, 2014). The …

Machine theory of mind

N Rabinowitz, F Perbet, F Song… - International …, 2018 - proceedings.mlr.press
Abstract Theory of mind (ToM) broadly refers to humans' ability to represent the mental
states of others, including their desires, beliefs, and intentions. We design a Theory of Mind …

AI safety gridworlds

J Leike, M Martic, V Krakovna, PA Ortega… - arXiv preprint arXiv …, 2017 - arxiv.org
We present a suite of reinforcement learning environments illustrating various safety
properties of intelligent agents. These problems include safe interruptibility, avoiding side …

Trustworthy ai

R Chatila, V Dignum, M Fisher, F Giannotti… - Reflections on artificial …, 2021 - Springer
Modern AI systems have become of widespread use in almost all sectors with a strong
impact on our society. However, the very methods on which they rely, based on Machine …

Safe imitation learning via fast bayesian reward inference from preferences

D Brown, R Coleman, R Srinivasan… - … on Machine Learning, 2020 - proceedings.mlr.press
Bayesian reward learning from demonstrations enables rigorous safety and uncertainty
analysis when performing imitation learning. However, Bayesian reward learning methods …

[HTML][HTML] Hard choices in artificial intelligence

R Dobbe, TK Gilbert, Y Mintz - Artificial Intelligence, 2021 - Elsevier
As AI systems are integrated into high stakes social domains, researchers now examine how
to design and operate them in a safe and ethical manner. However, the criteria for identifying …

Advanced artificial agents intervene in the provision of reward

M Cohen, M Hutter, M Osborne - AI magazine, 2022 - ojs.aaai.org
To analyze the expected behavior of advanced artificial agents, we consider a formal
idealized agent that makes observations that inform it about its goal, and we find that it can …

[HTML][HTML] Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

T Everitt, M Hutter, R Kumar, V Krakovna - Synthese, 2021 - Springer
Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding?
Or will sufficiently capable RL agents always find ways to bypass their intended objectives …

Occam's razor is insufficient to infer the preferences of irrational agents

S Armstrong, S Mindermann - Advances in neural …, 2018 - proceedings.neurips.cc
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from
observed behavior. Since human planning systematically deviates from rationality, several …