Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Safe RLHF: Safe Reinforcement Learning from Human Feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv e …, 2023 - ui.adsabs.harvard.edu
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Safe RLHF: Safe Reinforcement Learning from Human Feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - The Twelfth International … - openreview.net
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …