Artificial intelligence, values, and alignment

J Skalse, N Howe… - Advances in Neural …, 2022 - proceedings.neurips.cc

We provide the first formal definition of\textbf {reward hacking}, a phenomenon where
optimizing an imperfect proxy reward function, $\mathcal {\tilde {R}} $, leads to poor …

被引用次数：131 相关文章所有 7 个版本

[PDF] jmlr.org

Foundation models and fair use

P Henderson, X Li, D Jurafsky, T Hashimoto… - Journal of Machine …, 2023 - jmlr.org

Existing foundation models are trained on copyrighted material. Deploying these models
can pose both legal and ethical risks when data creators fail to receive appropriate …

被引用次数：84 相关文章所有 4 个版本

[PDF] arxiv.org

A general language assistant as a laboratory for alignment

A Askell, Y Bai, A Chen, D Drain, D Ganguli… - arXiv preprint arXiv …, 2021 - arxiv.org

Given the broad capabilities of large language models, it should be possible to work towards
a general-purpose, text-based assistant that is aligned with human values, meaning that it is …

被引用次数：267 相关文章所有 4 个版本

[PDF] arxiv.org

Towards measuring the representation of subjective global opinions in language models

E Durmus, K Nyugen, TI Liao, N Schiefer… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) may not equitably represent diverse global perspectives on
societal issues. In this paper, we develop a quantitative framework to evaluate whose …

被引用次数：90 相关文章所有 2 个版本

[PDF] arxiv.org

Unsolved problems in ml safety

D Hendrycks, N Carlini, J Schulman… - arXiv preprint arXiv …, 2021 - arxiv.org

Machine learning (ML) systems are rapidly increasing in size, are acquiring new
capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …

被引用次数：264 相关文章所有 6 个版本

[HTML] nowpublishers.com

[HTML][HTML] Identifying and mitigating the security risks of generative ai

C Barrett, B Boyd, E Bursztein, N Carlini… - … and Trends® in …, 2023 - nowpublishers.com

Every major technical invention resurfaces the dual-use dilemma—the new technology has
the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such …

被引用次数：46 相关文章所有 7 个版本

[PDF] arxiv.org

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

被引用次数：75 相关文章所有 3 个版本

[PDF] arxiv.org

Smoothllm: Defending large language models against jailbreaking attacks

A Robey, E Wong, H Hassani, GJ Pappas - arXiv preprint arXiv …, 2023 - arxiv.org

Despite efforts to align large language models (LLMs) with human values, widely-used
LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks …

被引用次数：86 相关文章所有 4 个版本

[PDF] arxiv.org

The alignment problem from a deep learning perspective

R Ngo, L Chan, S Mindermann - arXiv preprint arXiv:2209.00626, 2022 - arxiv.org

In coming decades, artificial general intelligence (AGI) may surpass human capabilities at
many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to …

被引用次数：120 相关文章所有 4 个版本

[PDF] tandfonline.com

A taxonomy of prompt modifiers for text-to-image generation

J Oppenlaender - Behaviour & Information Technology, 2023 - Taylor & Francis

Text-guided synthesis of images has become enormously popular and online communities
dedicated to text-to-image generation and art generated with Artificial Intelligence (AI) have …

被引用次数：105 相关文章所有 4 个版本

高级搜索

QQ 群