相关文章- 学术资源搜索

The inadequacy of reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities

TR McIntosh, T Susnjak, T Liu, P Watters… - … on Cognitive and …, 2024 - ieeexplore.ieee.org

This study is an empirical investigation into the semantic vulnerabilities of four popular pre-
trained commercial Large Language Models (LLMs) to ideological manipulation. Using …

被引用次数：24 相关文章

[PDF] arxiv.org

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

被引用次数：248 相关文章所有 6 个版本

[PDF] arxiv.org

On the exploitability of reinforcement learning with human feedback for large language models

J Wang, J Wu, M Chen, Y Vorobeychik… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align
Large Language Models (LLMs) with human preferences, playing an important role in LLMs …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Opening the black box of large language models: Two views on holistic interpretability

H Zhao, F Yang, H Lakkaraju, M Du - arXiv preprint arXiv:2402.10688, 2024 - arxiv.org

As large language models (LLMs) grow more powerful, concerns around potential harms
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …

被引用次数：6 相关文章所有 2 个版本

[PDF] neurips.cc

Fine-grained human feedback gives better rewards for language model training

Z Wu, Y Hu, W Shi, N Dziri, A Suhr… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …

被引用次数：41 相关文章所有 6 个版本

[PDF] arxiv.org

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

K Zhou, JD Hwang, X Ren, M Sap - arXiv preprint arXiv:2401.06730, 2024 - arxiv.org

As natural language becomes the default interface for human-AI interaction, there is a critical
need for LMs to appropriately communicate uncertainties in downstream applications. In this …

被引用次数：16 相关文章所有 4 个版本

[PDF] neurips.cc

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

被引用次数：132 相关文章所有 8 个版本

[PDF] arxiv.org

Can LLMs Follow Simple Rules?

N Mu, S Chen, Z Wang, S Chen, D Karamardian… - arXiv preprint arXiv …, 2023 - arxiv.org

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities,
it is important to be able to specify and constrain the behavior of these systems in a reliable …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

被引用次数：86 相关文章所有 3 个版本

[PDF] arxiv.org

Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond

H Sun - arXiv preprint arXiv:2310.06147, 2023 - arxiv.org

Recent advancements in Large Language Models (LLMs) have garnered wide attention and
led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to …

被引用次数：7 相关文章所有 2 个版本

高级搜索

QQ 群

The inadequacy of reinforcement learning from human feedback-radicalizing large language models via semantic vulnerabilities

Open problems and fundamental limitations of reinforcement learning from human feedback

On the exploitability of reinforcement learning with human feedback for large language models

Opening the black box of large language models: Two views on holistic interpretability

Fine-grained human feedback gives better rewards for language model training

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Can LLMs Follow Simple Rules?

Safe rlhf: Safe reinforcement learning from human feedback

Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond

相关搜索

引用