Unsolved problems in ml safety

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：243 相关文章所有 3 个版本

[PDF] arxiv.org

A review of safe reinforcement learning: Methods, theory and applications

S Gu, L Yang, Y Du, G Chen, F Walter, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Reinforcement learning (RL) has achieved tremendous success in many complex decision
making tasks. When it comes to deploying RL in the real world, safety concerns are usually …

被引用次数：197 相关文章所有 2 个版本

[HTML] sciencedirect.com

[HTML][HTML] Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation

N Díaz-Rodríguez, J Del Ser, M Coeckelbergh… - Information …, 2023 - Elsevier

Abstract Trustworthy Artificial Intelligence (AI) is based on seven technical requirements
sustained over three main pillars that should be met throughout the system's entire life cycle …

被引用次数：137 相关文章所有 6 个版本

[PDF] arxiv.org

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：239 相关文章所有 6 个版本

[PDF] arxiv.org

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arXiv preprint arXiv …, 2022 - arxiv.org

Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

被引用次数：363 相关文章所有 4 个版本

[PDF] neurips.cc

Openood: Benchmarking generalized out-of-distribution detection

J Yang, P Wang, D Zou, Z Zhou… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Out-of-distribution (OOD) detection is vital to safety-critical machine learning
applications and has thus been extensively studied, with a plethora of methods developed in …

被引用次数：158 相关文章所有 7 个版本

[PDF] mlr.press

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

A Pan, JS Chan, A Zou, N Li, S Basart… - International …, 2023 - proceedings.mlr.press

Artificial agents have traditionally been trained to maximize reward, which may incentivize
power-seeking and deception, analogous to how next-token prediction in language models …

被引用次数：87 相关文章所有 6 个版本

[PDF] arxiv.org

Generalized out-of-distribution detection: A survey

J Yang, K Zhou, Y Li, Z Liu - International Journal of Computer Vision, 2024 - Springer

Abstract Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of
machine learning systems. For instance, in autonomous driving, we would like the driving …

被引用次数：698 相关文章所有 4 个版本

[PDF] arxiv.org

Prompting gpt-3 to be reliable

C Si, Z Gan, Z Yang, S Wang, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models (LLMs) show impressive abilities via few-shot prompting.
Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world …

被引用次数：178 相关文章所有 3 个版本

[PDF] acm.org

Predictability and surprise in large generative models

D Ganguli, D Hernandez, L Lovitt, A Askell… - Proceedings of the …, 2022 - dl.acm.org

Large-scale pre-training has recently emerged as a technique for creating capable, general-
purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many …

被引用次数：213 相关文章所有 6 个版本

高级搜索

QQ 群

Challenges and applications of large language models

A review of safe reinforcement learning: Methods, theory and applications

[HTML][HTML] Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation

Open problems and fundamental limitations of reinforcement learning from human feedback

Red teaming language models with language models

Openood: Benchmarking generalized out-of-distribution detection

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

Generalized out-of-distribution detection: A survey

Prompting gpt-3 to be reliable

Predictability and surprise in large generative models

引用