A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

Data augmentation using llms: Data perspectives, learning paradigms and challenges

B Ding, C Qin, R Zhao, T Luo, X Li… - Findings of the …, 2024 - aclanthology.org
In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has
emerged as a pivotal technique for enhancing model performance by diversifying training …

Don't make your llm an evaluation benchmark cheater

K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence,
attaining remarkable improvement in model capacity. To assess the model performance, a …

How much are llms contaminated? a comprehensive survey and the llmsanitize library

M Ravaut, B Ding, F Jiao, H Chen, X Li, R Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rise of Large Language Models (LLMs) in recent years, new opportunities are
emerging, but also new challenges, and contamination is quickly becoming critical …

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling

D Kim, C Park, S Kim, W Lee, W Song, Y Kim… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce SOLAR 10.7 B, a large language model (LLM) with 10.7 billion parameters,
demonstrating superior performance in various natural language processing (NLP) tasks …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Investigating data contamination in modern benchmarks for large language models

C Deng, Y Zhao, X Tang, M Gerstein… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent observations have underscored a disparity between the inflated benchmark scores
and the actual performance of LLMs, raising concerns about potential contamination of …

One thousand and one pairs: A" novel" challenge for long-context language models

M Karpinska, K Thai, K Lo, T Goyal, M Iyyer - arXiv preprint arXiv …, 2024 - arxiv.org
Synthetic long-context LLM benchmarks (eg," needle-in-the-haystack") test only surface-
level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and …

The information retrieval experiment platform

M Fröbe, JH Reimer, S MacAvaney, N Deckers… - Proceedings of the 46th …, 2023 - dl.acm.org
We integrate irdatasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval
Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and …

Same task, more tokens: the impact of input length on the reasoning performance of large language models

M Levy, A Jacoby, Y Goldberg - arXiv preprint arXiv:2402.14848, 2024 - arxiv.org
This paper explores the impact of extending input lengths on the capabilities of Large
Language Models (LLMs). Despite LLMs advancements in recent times, their performance …