Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Y Yuan, W Jiao, W Wang, J Huang, P He, S Shi… - arXiv preprint arXiv …, 2023 - arxiv.org
Safety lies at the core of the development of Large Language Models (LLMs). There is
ample work on aligning LLMs with human ethics and preferences, including data filtering in …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …

Safe dreamerv3: Safe reinforcement learning with world models

W Huang, J Ji, B Zhang, C Xia, Y Yang - arXiv preprint arXiv:2307.07176, 2023 - arxiv.org
The widespread application of Reinforcement Learning (RL) in real-world situations is yet to
come to fruition, largely as a result of its failure to satisfy the essential safety demands of …

Hummer: Towards limited competitive preference dataset

L Jiang, Y Wu, J Xiong, J Ruan, Y Ding, Q Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference datasets are essential for incorporating human preferences into pre-trained
language models, playing a key role in the success of Reinforcement Learning from Human …

A safety realignment framework via subspace-oriented model fusion for large language models

X Yi, S Zheng, L Wang, X Wang, L He - arXiv preprint arXiv:2405.09055, 2024 - arxiv.org
The current safeguard mechanisms for large language models (LLMs) are indeed
susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine …

Quantifying the Gain in Weak-to-Strong Generalization

M Charikar, C Pabbaraju, K Shiragur - arXiv preprint arXiv:2405.15116, 2024 - arxiv.org
Recent advances in large language models have shown capabilities that are extraordinary
and near-superhuman. These models operate with such complexity that reliably evaluating …

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

J Ji, D Hong, B Zhang, B Chen, J Dai, B Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on
safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and …

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong… - arXiv preprint arXiv …, 2024 - arxiv.org
Although Large Language Models (LLMs) are becoming increasingly powerful, they still
exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding …