Conservative agency via attainable utility preservation

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

被引用次数：104 相关文章所有 3 个版本

[PDF] neurips.cc

Defining and characterizing reward gaming

J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc

We provide the first formal definition of\textbf {reward hacking}, a phenomenon where
optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …

被引用次数：233 相关文章所有 7 个版本

[PDF] arxiv.org

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

T Everitt, M Hutter, R Kumar, V Krakovna - Synthese, 2021 - Springer

Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding?
Or will sufficiently capable RL agents always find ways to bypass their intended objectives …

被引用次数：101 相关文章所有 11 个版本

[PDF] arxiv.org

TASRA: a taxonomy and analysis of societal-scale risks from AI

A Critch, S Russell - arXiv preprint arXiv:2306.06924, 2023 - arxiv.org

While several recent works have identified societal-scale and extinction-level risks to
humanity arising from artificial intelligence, few have attempted an {\em exhaustive …

被引用次数：31 相关文章所有 2 个版本

[PDF] arxiv.org

Optimal policies tend to seek power

AM Turner, L Smith, R Shah, A Critch… - arXiv preprint arXiv …, 2019 - arxiv.org

Some researchers speculate that intelligent reinforcement learning (RL) agents would be
incentivized to seek resources and power in pursuit of their objectives. Other researchers …

被引用次数：81 相关文章所有 7 个版本

[PDF] neurips.cc

Avoiding side effects by considering future tasks

V Krakovna, L Orseau, R Ngo… - Advances in Neural …, 2020 - proceedings.neurips.cc

Designing reward functions is difficult: the designer has to specify what to do (what it means
to complete the task) as well as what not to do (side effects that should be avoided while …

被引用次数：50 相关文章所有 7 个版本

[PDF] mlr.press

Human control: definitions and algorithms

R Carey, T Everitt - Uncertainty in Artificial Intelligence, 2023 - proceedings.mlr.press

How can humans stay in control of advanced artificial intelligence systems? One proposal is
corrigibility, which requires the agent to follow the instructions of a human overseer, without …

被引用次数：16 相关文章所有 8 个版本

[PDF] neurips.cc

Avoiding side effects in complex environments

A Turner, N Ratzlaff, P Tadepalli - Advances in Neural …, 2020 - proceedings.neurips.cc

Reward function specification can be difficult. Rewarding the agent for making a widget may
be easy, but penalizing the multitude of possible negative side effects is hard. In toy …

被引用次数：49 相关文章所有 7 个版本

[PDF] arxiv.org

Penalizing side effects using stepwise relative reachability

V Krakovna, L Orseau, R Kumar, M Martic… - arXiv preprint arXiv …, 2018 - arxiv.org

How can we design safe reinforcement learning agents that avoid unnecessary disruptions
to their environment? We show that current approaches to penalizing side effects can …

被引用次数：62 相关文章所有 5 个版本

[PDF] mlr.press

Reinforcement learning under moral uncertainty

A Ecoffet, J Lehman - International conference on machine …, 2021 - proceedings.mlr.press

An ambitious goal for machine learning is to create agents that behave ethically: The
capacity to abide by human moral norms would greatly expand the context in which …

被引用次数：41 相关文章所有 4 个版本

高级搜索

QQ 群