Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

Demystifying verbatim memorization in large language models

J Huang, D Yang, C Potts - arXiv preprint arXiv:2407.17817, 2024 - arxiv.org
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with
serious legal and privacy implications. Much prior work has studied such verbatim …

Precurious: How innocent pre-trained language models turn into privacy traps

R Liu, T Wang, Y Cao, L Xiong - Proceedings of the 2024 on ACM …, 2024 - dl.acm.org
The pre-training and fine-tuning paradigm has demonstrated its effectiveness and has
become the standard approach for tailoring language models to various tasks. Currently …

Sok: Memorization in general-purpose large language models

V Hartmann, A Suri, V Bindschaedler, D Evans… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) are advancing at a remarkable pace, with myriad
applications under development. Unlike most earlier machine learning models, they are no …

Layerwise linear mode connectivity

L Adilova, A Fischer, M Jaggi - arXiv preprint arXiv:2307.06966, 2023 - arxiv.org
In the federated setup one performs an aggregation of separate local models multiple times
during training in order to obtain a stronger global model; most often aggregation is a simple …

Memorization in deep learning: A survey

J Wei, Y Zhang, LY Zhang, M Ding, C Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep Learning (DL) powered by Deep Neural Networks (DNNs) has revolutionized various
domains, yet understanding the intricacies of DNN decision-making and learning processes …

Decomposing and editing predictions by modeling model computation

H Shah, A Ilyas, A Madry - arXiv preprint arXiv:2404.11534, 2024 - arxiv.org
How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …

Dean: Deactivating the coupled neurons to mitigate fairness-privacy conflicts in large language models

C Qian, D Liu, J Zhang, Y Liu, J Shao - arXiv preprint arXiv:2410.16672, 2024 - arxiv.org
Ensuring awareness of fairness and privacy in Large Language Models (LLMs) is critical.
Interestingly, we discover a counter-intuitive trade-off phenomenon that enhancing an LLM's …

Memorization in self-supervised learning improves downstream generalization

W Wang, MA Kaleem, A Dziedzic, M Backes… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-supervised learning (SSL) has recently received significant attention due to its ability to
train high-performance encoders purely on unlabeled data-often scraped from the internet …

PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding

KK Nakka, A Frikha, R Mendes, X Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
The latest and most impactful advances in large models stem from their increased size.
Unfortunately, this translates into an improved memorization capacity, raising data privacy …