Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

YL Tuan, X Chen, EM Smith, L Martin, S Batra… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) become easily accessible nowadays, the trade-off
between safety and helpfulness can significantly impact user experience. A model that …

Bi-factorial preference optimization: Balancing safety-helpfulness in language models

W Zhang, PHS Torr, M Elhoseiny, A Bibi - arXiv preprint arXiv:2408.15313, 2024 - arxiv.org
Fine-tuning large language models (LLMs) on human preferences, typically through
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

K Chehbouni, M Roshan, E Ma, FA Wei, A Taïk… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent progress in large language models (LLMs) has led to their widespread adoption in
various domains. However, these advancements have also introduced additional safety …

Rule based rewards for language model safety

T Mu, A Helyar, J Heidecke, J Achiam… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning based fine-tuning of large language models (LLMs) on human
preferences has been shown to enhance both their capabilities and safety behavior …

Exploring Safety-Utility Trade-Offs in Personalized Language Models

AR Vijjini, SBR Chowdhury, S Chaturvedi - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly integrated into daily applications, it
is essential to ensure they operate fairly across diverse user demographics. In this work, we …

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

SY Peng, PY Chen, M Hull, DH Chau - arXiv preprint arXiv:2405.17374, 2024 - arxiv.org
Safety alignment is the key to guiding the behaviors of large language models (LLMs) that
are in line with human preferences and restrict harmful behaviors at inference time, but …

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

J Ji, D Hong, B Zhang, B Chen, J Dai, B Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on
safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and …

Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions

L Cotta, CJ Maddison - arXiv preprint arXiv:2406.07685, 2024 - arxiv.org
Frontier Large Language Models (LLMs) are increasingly being deployed for high-stakes
decision-making. On the other hand, these models are still consistently making predictions …

Making harmful behaviors unlearnable for large language models

X Zhou, Y Lu, R Ma, T Gui, Q Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have shown great potential as general-purpose AI
assistants in various domains. To meet the requirements of different applications, LLMs are …

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …