Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Language instructed reinforcement learning for human-ai coordination

H Hu, D Sadigh - International Conference on Machine …, 2023 - proceedings.mlr.press
One of the fundamental quests of AI is to produce agents that coordinate well with humans.
This problem is challenging, especially in domains that lack high quality human behavioral …

Learning to influence human behavior with offline reinforcement learning

J Hong, S Levine, A Dragan - Advances in Neural …, 2024 - proceedings.neurips.cc
When interacting with people, AI agents do not just influence the state of the world--they also
influence the actions people take in response to the agent, and even their underlying …

Causal confusion and reward misidentification in preference-based reward learning

J Tien, JZY He, Z Erickson, AD Dragan… - arXiv preprint arXiv …, 2022 - arxiv.org
Learning policies via preference-based reward learning is an increasingly popular method
for customizing agent behavior, but has been shown anecdotally to be prone to spurious …

(Ir) rationality in AI: State of the Art, Research Challenges and Open Questions

O Macmillan-Scott, M Musolesi - arXiv preprint arXiv:2311.17165, 2023 - arxiv.org
The concept of rationality is central to the field of artificial intelligence. Whether we are
seeking to simulate human reasoning, or the goal is to achieve bounded optimality, we …

Learning zero-shot cooperation with humans, assuming humans are biased

C Yu, J Gao, W Liu, B Xu, H Tang, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an
agent that can cooperate with humans in a zero-shot fashion without using any human data …

Personalizing reinforcement learning from human feedback with variational preference learning

S Poddar, Y Wan, H Ivison, A Gupta… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …

Beyond preferences in ai alignment

T Zhi-Xuan, M Carroll, M Franklin, H Ashton - Philosophical Studies, 2024 - Springer
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values,(2) that human rationality can be understood in terms of …

Learning to assist humans without inferring rewards

V Myers, E Ellis, S Levine, B Eysenbach… - arXiv preprint arXiv …, 2024 - arxiv.org
Assistive agents should make humans' lives easier. Classically, such assistance is studied
through the lens of inverse reinforcement learning, where an assistive agent (eg, a chatbot …

Learning to make adherence-aware advice

G Chen, X Li, C Sun, H Wang - arXiv preprint arXiv:2310.00817, 2023 - arxiv.org
As artificial intelligence (AI) systems play an increasingly prominent role in human decision-
making, challenges surface in the realm of human-AI interactions. One challenge arises from …