相关文章- 学术资源搜索

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

YL Tuan, X Chen, EM Smith, L Martin, S Batra… - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) become easily accessible nowadays, the trade-off
between safety and helpfulness can significantly impact user experience. A model that …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Bi-factorial preference optimization: Balancing safety-helpfulness in language models

W Zhang, PHS Torr, M Elhoseiny, A Bibi - arXiv preprint arXiv:2408.15313, 2024 - arxiv.org

Fine-tuning large language models (LLMs) on human preferences, typically through
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

K Chehbouni, M Roshan, E Ma, FA Wei, A Taïk… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent progress in large language models (LLMs) has led to their widespread adoption in
various domains. However, these advancements have also introduced additional safety …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Rule based rewards for language model safety

T Mu, A Helyar, J Heidecke, J Achiam… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning based fine-tuning of large language models (LLMs) on human
preferences has been shown to enhance both their capabilities and safety behavior …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Exploring Safety-Utility Trade-Offs in Personalized Language Models

AR Vijjini, SBR Chowdhury, S Chaturvedi - arXiv preprint arXiv …, 2024 - arxiv.org

As large language models (LLMs) become increasingly integrated into daily applications, it
is essential to ensure they operate fairly across diverse user demographics. In this work, we …

被引用次数：3 相关文章

[PDF] arxiv.org

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

SY Peng, PY Chen, M Hull, DH Chau - arXiv preprint arXiv:2405.17374, 2024 - arxiv.org

Safety alignment is the key to guiding the behaviors of large language models (LLMs) that
are in line with human preferences and restrict harmful behaviors at inference time, but …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

J Ji, D Hong, B Zhang, B Chen, J Dai, B Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on
safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and …

被引用次数：9 相关文章

[PDF] arxiv.org

Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions

L Cotta, CJ Maddison - arXiv preprint arXiv:2406.07685, 2024 - arxiv.org

Frontier Large Language Models (LLMs) are increasingly being deployed for high-stakes
decision-making. On the other hand, these models are still consistently making predictions …

Making harmful behaviors unlearnable for large language models

X Zhou, Y Lu, R Ma, T Gui, Q Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have shown great potential as general-purpose AI
assistants in various domains. To meet the requirements of different applications, LLMs are …

被引用次数：14 相关文章所有 2 个版本

[PDF] neurips.cc

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

被引用次数：301 相关文章所有 8 个版本

高级搜索

QQ 群

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Bi-factorial preference optimization: Balancing safety-helpfulness in language models

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

Rule based rewards for language model safety

Exploring Safety-Utility Trade-Offs in Personalized Language Models

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions

Making harmful behaviors unlearnable for large language models

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

相关搜索

引用