Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state …
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories …
L Gao, S Biderman, S Black, L Golding… - arXiv preprint arXiv …, 2020 - arxiv.org
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale …
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …
Companies like OpenAI, Google DeepMind, and Anthropic have the stated goal of building artificial general intelligence (AGI)-AI systems that perform as well as or better than humans …
This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture …
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various …
P Kumar - Artificial Intelligence Review, 2024 - Springer
Artificial intelligence (AI) has significantly impacted various fields. Large language models (LLMs) like GPT-4, BARD, PaLM, Megatron-Turing NLG, Jurassic-1 Jumbo etc., have …
O Evans, O Cotton-Barratt, L Finnveden… - arXiv preprint arXiv …, 2021 - arxiv.org
In many contexts, lying--the use of verbal falsehoods to deceive--is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are …