[HTML][HTML] AI deception: A survey of examples, risks, and potential solutions

PS Park, S Goldstein, A O'Gara, M Chen, D Hendrycks - Patterns, 2024 - cell.com
This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some …

Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

[PDF][PDF] Managing ai risks in an era of rapid progress

Y Bengio, G Hinton, A Yao, D Song… - arXiv preprint arXiv …, 2023 - blog.biocomm.ai
In this short consensus paper, we outline risks from upcoming, advanced AI systems. We
examine large-scale social harms and malicious uses, as well as an irreversible loss of …

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

A Rame, G Couairon, C Dancette… - Advances in …, 2024 - proceedings.neurips.cc
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned
on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further …

The alignment problem from a deep learning perspective

R Ngo, L Chan, S Mindermann - arXiv preprint arXiv:2209.00626, 2022 - arxiv.org
In coming decades, artificial general intelligence (AGI) may surpass human capabilities at
many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to …

Steve-1: A generative model for text-to-behavior in minecraft

S Lifshitz, K Paster, H Chan, J Ba… - Advances in Neural …, 2024 - proceedings.neurips.cc
Constructing AI models that respond to text instructions is challenging, especially for
sequential decision-making tasks. This work introduces an instruction-tuned Video …

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods

Y Cao, H Zhao, Y Cheng, T Shu, Y Chen… - … on Neural Networks …, 2024 - ieeexplore.ieee.org
With extensive pretrained knowledge and high-level general capabilities, large language
models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in …

Harms from increasingly agentic algorithmic systems

A Chan, R Salganik, A Markelius, C Pang… - Proceedings of the …, 2023 - dl.acm.org
Research in Fairness, Accountability, Transparency, and Ethics (FATE) 1 has established
many sources and forms of algorithmic harm, in domains as diverse as health care, finance …