A survey on data selection for language models

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Qwen technical report

J Bai, S Bai, Y Chu, Z Cui, K Dang, X Deng… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have revolutionized the field of artificial intelligence,
enabling natural language processing tasks that were previously thought to be exclusive to …

Wizardcoder: Empowering code large language models with evol-instruct

Z Luo, C Xu, P Zhao, Q Sun, X Geng, W Hu… - arXiv preprint arXiv …, 2023 - arxiv.org
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated
exceptional performance in code-related tasks. However, most existing models are solely …

Scaling data-constrained language models

N Muennighoff, A Rush, B Barak… - Advances in …, 2024 - proceedings.neurips.cc
The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …

C-pack: Packaged resources to advance general chinese embedding

S Xiao, Z Liu, P Zhang, N Muennighoff, D Lian… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce C-Pack, a package of resources that significantly advance the field of general
Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a …

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

Starcoder 2 and the stack v2: The next generation

A Lozhkov, R Li, LB Allal, F Cassano… - arXiv preprint arXiv …, 2024 - arxiv.org
The BigCode project, an open-scientific collaboration focused on the responsible
development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In …

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended)

R Sun, SÖ Arik, A Muzio, L Miculicich… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-SQL, the process of translating natural language into Structured Query Language
(SQL), represents a transformative application of large language models (LLMs), potentially …

A survey of large language models for code: Evolution, benchmarking, and future trends

Z Zheng, K Ning, Y Wang, J Zhang, D Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
General large language models (LLMs), represented by ChatGPT, have demonstrated
significant potential in tasks such as code generation in software engineering. This has led …

The data provenance initiative: A large scale audit of dataset licensing & attribution in ai

S Longpre, R Mahari, A Chen, N Obeng-Marnu… - arXiv preprint arXiv …, 2023 - arxiv.org
The race to train language models on vast, diverse, and inconsistently documented datasets
has raised pressing concerns about the legal and ethical risks for practitioners. To remedy …