The inadequacy of reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities

TR McIntosh, T Susnjak, T Liu, P Watters… - … on Cognitive and …, 2024 - ieeexplore.ieee.org
This study is an empirical investigation into the semantic vulnerabilities of four popular pre-
trained commercial Large Language Models (LLMs) to ideological manipulation. Using …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

On the exploitability of reinforcement learning with human feedback for large language models

J Wang, J Wu, M Chen, Y Vorobeychik… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align
Large Language Models (LLMs) with human preferences, playing an important role in LLMs …

Opening the black box of large language models: Two views on holistic interpretability

H Zhao, F Yang, H Lakkaraju, M Du - arXiv preprint arXiv:2402.10688, 2024 - arxiv.org
As large language models (LLMs) grow more powerful, concerns around potential harms
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …

Fine-grained human feedback gives better rewards for language model training

Z Wu, Y Hu, W Shi, N Dziri, A Suhr… - Advances in …, 2024 - proceedings.neurips.cc
Abstract Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

K Zhou, JD Hwang, X Ren, M Sap - arXiv preprint arXiv:2401.06730, 2024 - arxiv.org
As natural language becomes the default interface for human-AI interaction, there is a critical
need for LMs to appropriately communicate uncertainties in downstream applications. In this …

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

Can LLMs Follow Simple Rules?

N Mu, S Chen, Z Wang, S Chen, D Karamardian… - arXiv preprint arXiv …, 2023 - arxiv.org
As Large Language Models (LLMs) are deployed with increasing real-world responsibilities,
it is important to be able to specify and constrain the behavior of these systems in a reliable …

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond

H Sun - arXiv preprint arXiv:2310.06147, 2023 - arxiv.org
Recent advancements in Large Language Models (LLMs) have garnered wide attention and
led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to …