Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

A review of safe reinforcement learning: Methods, theory and applications

S Gu, L Yang, Y Du, G Chen, F Walter, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Reinforcement learning (RL) has achieved tremendous success in many complex decision
making tasks. When it comes to deploying RL in the real world, safety concerns are usually …

[HTML][HTML] Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation

N Díaz-Rodríguez, J Del Ser, M Coeckelbergh… - Information …, 2023 - Elsevier
Abstract Trustworthy Artificial Intelligence (AI) is based on seven technical requirements
sustained over three main pillars that should be met throughout the system's entire life cycle …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arXiv preprint arXiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

Openood: Benchmarking generalized out-of-distribution detection

J Yang, P Wang, D Zou, Z Zhou… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Out-of-distribution (OOD) detection is vital to safety-critical machine learning
applications and has thus been extensively studied, with a plethora of methods developed in …

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

A Pan, JS Chan, A Zou, N Li, S Basart… - International …, 2023 - proceedings.mlr.press
Artificial agents have traditionally been trained to maximize reward, which may incentivize
power-seeking and deception, analogous to how next-token prediction in language models …

Generalized out-of-distribution detection: A survey

J Yang, K Zhou, Y Li, Z Liu - International Journal of Computer Vision, 2024 - Springer
Abstract Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of
machine learning systems. For instance, in autonomous driving, we would like the driving …

Prompting gpt-3 to be reliable

C Si, Z Gan, Z Yang, S Wang, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models (LLMs) show impressive abilities via few-shot prompting.
Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world …

Predictability and surprise in large generative models

D Ganguli, D Hernandez, L Lovitt, A Askell… - Proceedings of the …, 2022 - dl.acm.org
Large-scale pre-training has recently emerged as a technique for creating capable, general-
purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many …