Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The pile: An 800gb dataset of diverse text for language modeling

L Gao, S Biderman, S Black, L Golding… - arXiv preprint arXiv …, 2020 - arxiv.org
Recent work has demonstrated that increased training dataset diversity improves general
cross-domain knowledge and downstream generalization capability for large-scale …

Unsolved problems in ml safety

D Hendrycks, N Carlini, J Schulman… - arXiv preprint arXiv …, 2021 - arxiv.org
Machine learning (ML) systems are rapidly increasing in size, are acquiring new
capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …

Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries

L Koessler, J Schuett - arXiv preprint arXiv:2307.08823, 2023 - arxiv.org
Companies like OpenAI, Google DeepMind, and Anthropic have the stated goal of building
artificial general intelligence (AGI)-AI systems that perform as well as or better than humans …

Is power-seeking AI an existential risk?

J Carlsmith - arXiv preprint arXiv:2206.13353, 2022 - arxiv.org
This report examines what I see as the core argument for concern about existential risk from
misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture …

Large language model alignment: A survey

T Shen, R Jin, Y Huang, C Liu, W Dong, Z Guo… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

Large language models (LLMs): survey, technical frameworks, and future challenges

P Kumar - Artificial Intelligence Review, 2024 - Springer
Artificial intelligence (AI) has significantly impacted various fields. Large language models
(LLMs) like GPT-4, BARD, PaLM, Megatron-Turing NLG, Jurassic-1 Jumbo etc., have …

Truthful AI: Developing and governing AI that does not lie

O Evans, O Cotton-Barratt, L Finnveden… - arXiv preprint arXiv …, 2021 - arxiv.org
In many contexts, lying--the use of verbal falsehoods to deceive--is harmful. While lying has
traditionally been a human affair, AI systems that make sophisticated verbal statements are …