Analyzing leakage of personally identifiable information in language models

N Lukas, A Salem, R Sim, S Tople… - … IEEE Symposium on …, 2023 - ieeexplore.ieee.org
Language Models (LMs) have been shown to leak information about training data through
sentence-level membership inference and reconstruction attacks. Understanding the risk of …

Are large pre-trained language models leaking your personal information?

J Huang, H Shao, KCC Chang - arXiv preprint arXiv:2205.12628, 2022 - arxiv.org
Are Large Pre-Trained Language Models Leaking Your Personal Information? In this paper,
we analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal …

Quantifying privacy risks of masked language models using membership inference attacks

F Mireshghallah, K Goyal, A Uniyal… - arXiv preprint arXiv …, 2022 - arxiv.org
The wide adoption and application of Masked language models~(MLMs) on sensitive data
(from legal to medical) necessitates a thorough quantitative investigation into their privacy …

Downstream task performance of BERT models pre-trained using automatically de-identified clinical data

T Vakili, A Lamproudis, A Henriksson… - Proceedings of the …, 2022 - aclanthology.org
Automatic de-identification is a cost-effective and straightforward way of removing large
amounts of personally identifiable information from large and sensitive corpora. However …

Do not give away my secrets: Uncovering the privacy issue of neural code completion tools

Y Huang, Y Li, W Wu, J Zhang, MR Lyu - arXiv preprint arXiv:2309.07639, 2023 - arxiv.org
Neural Code Completion Tools (NCCTs) have reshaped the field of software development,
which accurately suggest contextually-relevant code snippets benefiting from language …

Disclosure control of machine learning models from trusted research environments (TRE): New challenges and opportunities

E Mansouri-Benssassi, S Rogers, S Reel, M Malone… - Heliyon, 2023 - cell.com
Introduction Artificial intelligence (AI) applications in healthcare and medicine have
increased in recent years. To enable access to personal data, Trusted Research …

Digger: Detecting copyright content mis-usage in large language model training

H Li, G Deng, Y Liu, K Wang, Y Li, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of
Large Language Models (LLMs) across numerous applications. However, the detailed …

Membership inference attacks with token-level deduplication on korean language models

MG Oh, LH Park, J Kim, J Park, T Kwon - IEEE Access, 2023 - ieeexplore.ieee.org
The confidentiality threat against training data has become a significant security problem in
neural language models. Recent studies have shown that memorized training data can be …

End-to-end pseudonymization of fine-tuned clinical BERT models: Privacy preservation with maintained data utility

T Vakili, A Henriksson, H Dalianis - BMC Medical Informatics and Decision …, 2024 - Springer
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained
language models (PLMs). These models consist of large amounts of parameters that are …

Using membership inference attacks to evaluate privacy-preserving language modeling fails for pseudonymizing data

T Vakili, H Dalianis - Proceedings of the 24th Nordic Conference …, 2023 - aclanthology.org
Large pre-trained language models dominate the current state-of-the-art for many natural
language processing applications, including the field of clinical NLP. Several studies have …