Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Gpt-4 technical report

J Achiam, S Adler, S Agarwal, L Ahmad… - arXiv preprint arXiv …, 2023 - arxiv.org
We report the development of GPT-4, a large-scale, multimodal model which can accept
image and text inputs and produce text outputs. While less capable than humans in many …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc
Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

Scaling laws for reward model overoptimization

L Gao, J Schulman, J Hilton - International Conference on …, 2023 - proceedings.mlr.press
In reinforcement learning from human feedback, it is common to optimize against a reward
model trained to predict human preferences. Because the reward model is an imperfect …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Transformers in healthcare: A survey

S Nerella, S Bandyopadhyay, J Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
With Artificial Intelligence (AI) increasingly permeating various aspects of society, including
healthcare, the adoption of the Transformers neural network architecture is rapidly changing …

Fundamental limitations of alignment in large language models

Y Wolf, N Wies, O Avnery, Y Levine… - arXiv preprint arXiv …, 2023 - arxiv.org
An important aspect in developing language models that interact with humans is aligning
their behavior to be useful and unharmful for their human users. This is usually achieved by …

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org
Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

[HTML][HTML] Generative AI: Here to stay, but for good?

HS Sætra - Technology in Society, 2023 - Elsevier
Generative AI has taken the world by storm, kicked off for real by ChatGPT and quickly
followed by further development and the release of GPT-4 and similar models from OpenAI's …