Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic

S Goyal, P Maini, ZC Lipton… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully
selected subsets of massive web scrapes. For instance the LAION public dataset retained …

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, R Basri, D Jacobs… - arXiv preprint arXiv …, 2024 - arxiv.org
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

Zamba: A Compact 7B SSM Hybrid Model

P Glorioso, Q Anthony, Y Tokpanov… - arXiv preprint arXiv …, 2024 - arxiv.org
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …

Reverse training to nurse the reversal curse

O Golovneva, Z Allen-Zhu, J Weston… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have a surprising failure: when trained on" A has a feature
B", they do not generalize to" B is a feature of A", which is termed the Reversal Curse. Even …

Scaling Synthetic Data Creation with 1,000,000,000 Personas

X Chan, X Wang, D Yu, H Mi, D Yu - arXiv preprint arXiv:2406.20094, 2024 - arxiv.org
We propose a novel persona-driven data synthesis methodology that leverages various
perspectives within a large language model (LLM) to create diverse synthetic data. To fully …

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training

M Hirano, K Imajo - arXiv preprint arXiv:2404.10555, 2024 - arxiv.org
Large language models (LLMs) are now widely used in various fields, including finance.
However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study …

Large Language Model-guided Document Selection

X Kong, T Gunter, R Pang - arXiv preprint arXiv:2406.04638, 2024 - arxiv.org
Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet
recent research has demonstrated that careful document selection enables comparable …

Towards Effective and Efficient Continual Pre-training of Large Language Models

J Chen, Z Chen, J Wang, K Zhou, Y Zhu, J Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Continual pre-training (CPT) has been an important approach for adapting language models
to specific domains or tasks. To make the CPT approach more traceable, this paper presents …

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

F Kang, Y Sun, B Wen, S Chen, D Song… - arXiv preprint arXiv …, 2024 - arxiv.org
To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data
mixtures over different domains. In this work, we demonstrate that the optimal data …

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Z Yu, S Das, C Xiong - arXiv preprint arXiv:2406.06046, 2024 - arxiv.org
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …