How much are llms contaminated? a comprehensive survey and the llmsanitize library

M Ravaut, B Ding, F Jiao, H Chen, X Li, R Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rise of Large Language Models (LLMs) in recent years, new opportunities are
emerging, but also new challenges, and contamination is quickly becoming critical …

Large Language Model for Vulnerability Detection and Repair: Literature Review and Roadmap

X Zhou, S Cao, X Sun, D Lo - arXiv preprint arXiv:2404.02525, 2024 - arxiv.org
The significant advancements in Large Language Models (LLMs) have resulted in their
widespread adoption across various tasks within Software Engineering (SE), including …

Instructional fingerprinting of large language models

J Xu, F Wang, MD Ma, PW Koh, C Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
The exorbitant cost of training Large language models (LLMs) from scratch makes it
essential to fingerprint the models to protect intellectual property via ownership …

On catastrophic inheritance of large foundation models

H Chen, B Raj, X Xie, J Wang - arXiv preprint arXiv:2402.01909, 2024 - arxiv.org
Large foundation models (LFMs) are claiming incredible performances. Yet great concerns
have been raised about their mythic and uninterpreted potentials not only in machine …

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

Z He, CY Huang, CKC Ding, S Rohatgi… - Proceedings of the CHI …, 2024 - dl.acm.org
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling
accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies …

Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks

L Ibrahim, S Huang, L Ahmad, M Anderljung - arXiv preprint arXiv …, 2024 - arxiv.org
Model evaluations are central to understanding the safety, risks, and societal impacts of AI
systems. While most real-world AI applications involve human-AI interaction, most current …

Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice

JQ Zhu, H Yan, TL Griffiths - arXiv preprint arXiv:2405.19313, 2024 - arxiv.org
The observed similarities in the behavior of humans and Large Language Models (LLMs)
have prompted researchers to consider the potential of using LLMs as models of human …

DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Z Zhang, J Chen, D Yang - arXiv preprint arXiv:2406.17271, 2024 - arxiv.org
The current paradigm of evaluating Large Language Models (LLMs) through static
benchmarks comes with significant limitations, such as vulnerability to data contamination …

LMD3: Language Model Data Density Dependence

J Kirchenbauer, G Honke, G Somepalli… - arXiv preprint arXiv …, 2024 - arxiv.org
We develop a methodology for analyzing language model task performance at the individual
example level based on training data density estimation. Experiments with paraphrasing as …

Questionable practices in machine learning

G Leech, JJ Vazquez, M Yagudin, N Kupper… - arXiv preprint arXiv …, 2024 - arxiv.org
Evaluating modern ML models is hard. The strong incentive for researchers and companies
to report a state-of-the-art result on some metric often leads to questionable research …