Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics

M Zahrah Choksi, D Goedicke - arXiv e-prints, 2023 - ui.adsabs.harvard.edu
Intelligent or generative writing tools rely on large language models that recognize,
summarize, translate, and predict content. This position paper probes the copyright interests …

Whose text is it anyway? exploring bigcode, intellectual property, and ethics

MZ Choksi, D Goedicke - arXiv preprint arXiv:2304.02839, 2023 - arxiv.org
Intelligent or generative writing tools rely on large language models that recognize,
summarize, translate, and predict content. This position paper probes the copyright interests …

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

X Liu, T Sun, T Xu, F Wu, C Wang, X Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have transformed machine learning but raised significant
legal concerns due to their potential to produce text that infringes on copyrights, resulting in …

Do language models plagiarize?

J Lee, T Le, J Chen, D Lee - Proceedings of the ACM Web Conference …, 2023 - dl.acm.org
Past literature has illustrated that language models (LMs) often memorize parts of training
instances and reproduce them in natural language generation (NLG) processes. However, it …

Through the looking glass: Learning to attribute synthetic text generated by language models

S Munir, B Batool, Z Shafiq, P Srinivasan… - Proceedings of the …, 2021 - aclanthology.org
Given the potential misuse of recent advances in synthetic text generation by language
models (LMs), it is important to have the capacity to attribute authorship of synthetic text …

The (ab) use of open source code to train large language models

A Al-Kaswan, M Izadi - 2023 IEEE/ACM 2nd International …, 2023 - ieeexplore.ieee.org
In recent years, Large Language Models (LLMs) have gained significant popularity due to
their ability to generate human-like text and their potential applications in various fields, such …

Digger: Detecting copyright content mis-usage in large language model training

H Li, G Deng, Y Liu, K Wang, Y Li, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of
Large Language Models (LLMs) across numerous applications. However, the detailed …

Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models

S Zhao, L Zhu, R Quan, Y Yang - arXiv preprint arXiv:2403.15740, 2024 - arxiv.org
Web user data plays a central role in the ecosystem of pre-trained large language models
(LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to …

Large language model applications for evaluation: Opportunities and ethical implications

CB Head, P Jasper, M McConnachie… - New directions for …, 2023 - Wiley Online Library
Large language models (LLMs) are a type of generative artificial intelligence (AI) designed
to produce text‐based content. LLMs use deep learning techniques and massively large …

Matching pairs: Attributing fine-tuned models to their pre-trained large language models

M Foley, A Rawat, T Lee, Y Hou, G Picco… - arXiv preprint arXiv …, 2023 - arxiv.org
The wide applicability and adaptability of generative large language models (LLMs) has
enabled their rapid adoption. While the pre-trained models can perform many tasks, such …