Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

被引用次数：10 相关文章所有 4 个版本

[PDF] aclanthology.org

Data augmentation using llms: Data perspectives, learning paradigms and challenges

B Ding, C Qin, R Zhao, T Luo, X Li… - Findings of the …, 2024 - aclanthology.org

In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has
emerged as a pivotal technique for enhancing model performance by diversifying training …

被引用次数：58 相关文章所有 3 个版本

[PDF] arxiv.org

Don't make your llm an evaluation benchmark cheater

K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence,
attaining remarkable improvement in model capacity. To assess the model performance, a …

被引用次数：120 相关文章所有 2 个版本

[PDF] arxiv.org

How much are llms contaminated? a comprehensive survey and the llmsanitize library

M Ravaut, B Ding, F Jiao, H Chen, X Li, R Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rise of Large Language Models (LLMs) in recent years, new opportunities are
emerging, but also new challenges, and contamination is quickly becoming critical …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling

D Kim, C Park, S Kim, W Lee, W Song, Y Kim… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce SOLAR 10.7 B, a large language model (LLM) with 10.7 billion parameters,
demonstrating superior performance in various natural language processing (NLP) tasks …

被引用次数：75 相关文章所有 2 个版本

[PDF] arxiv.org

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org

This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Investigating data contamination in modern benchmarks for large language models

C Deng, Y Zhao, X Tang, M Gerstein… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent observations have underscored a disparity between the inflated benchmark scores
and the actual performance of LLMs, raising concerns about potential contamination of …

被引用次数：64 相关文章所有 3 个版本

[PDF] arxiv.org

One thousand and one pairs: A" novel" challenge for long-context language models

M Karpinska, K Thai, K Lo, T Goyal, M Iyyer - arXiv preprint arXiv …, 2024 - arxiv.org

Synthetic long-context LLM benchmarks (eg," needle-in-the-haystack") test only surface-
level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

The information retrieval experiment platform

M Fröbe, JH Reimer, S MacAvaney, N Deckers… - Proceedings of the 46th …, 2023 - dl.acm.org

We integrate irdatasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval
Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and …

被引用次数：36 相关文章所有 8 个版本

[PDF] arxiv.org

Same task, more tokens: the impact of input length on the reasoning performance of large language models

M Levy, A Jacoby, Y Goldberg - arXiv preprint arXiv:2402.14848, 2024 - arxiv.org

This paper explores the impact of extending input lengths on the capabilities of Large
Language Models (LLMs). Despite LLMs advancements in recent times, their performance …

被引用次数：84 相关文章所有 4 个版本

高级搜索

QQ 群