The alignment problem from a deep learning perspective

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

被引用次数：92 相关文章所有 3 个版本

[PDF] arxiv.org

Gpt-4 technical report

J Achiam, S Adler, S Agarwal, L Ahmad… - arXiv preprint arXiv …, 2023 - arxiv.org

We report the development of GPT-4, a large-scale, multimodal model which can accept
image and text inputs and produce text outputs. While less capable than humans in many …

被引用次数：2153 相关文章所有 3 个版本

[PDF] neurips.cc

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

被引用次数：149 相关文章所有 6 个版本

[PDF] arxiv.org

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org

For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

被引用次数：345 相关文章所有 4 个版本

[PDF] mlr.press

Scaling laws for reward model overoptimization

L Gao, J Schulman, J Hilton - International Conference on …, 2023 - proceedings.mlr.press

In reinforcement learning from human feedback, it is common to optimize against a reward
model trained to predict human preferences. Because the reward model is an imperfect …

被引用次数：220 相关文章所有 7 个版本

[PDF] arxiv.org

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：239 相关文章所有 6 个版本

[PDF] arxiv.org

Transformers in healthcare: A survey

S Nerella, S Bandyopadhyay, J Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

With Artificial Intelligence (AI) increasingly permeating various aspects of society, including
healthcare, the adoption of the Transformers neural network architecture is rapidly changing …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Fundamental limitations of alignment in large language models

Y Wolf, N Wies, O Avnery, Y Levine… - arXiv preprint arXiv …, 2023 - arxiv.org

An important aspect in developing language models that interact with humans is aligning
their behavior to be useful and unharmful for their human users. This is usually achieved by …

被引用次数：115 相关文章所有 4 个版本

[PDF] arxiv.org

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org

Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

被引用次数：84 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Generative AI: Here to stay, but for good?

HS Sætra - Technology in Society, 2023 - Elsevier

Generative AI has taken the world by storm, kicked off for real by ChatGPT and quickly
followed by further development and the release of GPT-4 and similar models from OpenAI's …

被引用次数：102 相关文章所有 7 个版本

高级搜索

QQ 群