Negative preference optimization: From catastrophic collapse to effective unlearning

R Zhang, L Lin, Y Bai, S Mei - arXiv preprint arXiv:2404.05868, 2024 - arxiv.org
Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data
during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from …

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

K D'Oosterlinck, W Xu, C Develder… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are often aligned using contrastive alignment objectives
and preference pair datasets. The interaction between model, paired data, and objective …

Rethinking Entity-level Unlearning for Large Language Models

W Ma, X Feng, W Zhong, L Huang, Y Ye… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language model unlearning has gained increasing attention due to its potential to
mitigate security and privacy concerns. Current research predominantly focuses on Instance …

Finding Safety Neurons in Large Language Models

J Chen, X Wang, Z Yao, Y Bai, L Hou, J Li - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) excel in various capabilities but also pose safety risks such
as generating harmful content and misinformation, even after safety alignment. In this paper …