Autonomous driving has achieved significant milestones in research and development over the last two decades. There is increasing interest in the field as the deployment of …
The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the …
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these …
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state …
Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models …
Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many …
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories …
J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc
We provide the first formal definition of\textbf {reward hacking}, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …