[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch
Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

Is Child-Directed Speech Effective Training Data for Language Models?

SY Feng, ND Goodman, MC Frank - arXiv preprint arXiv:2408.03617, 2024 - arxiv.org
While high-performing language models are typically trained on hundreds of billions of
words, human children become fluent language users with a much smaller amount of data …

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Z Yu, S Das, C Xiong - arXiv preprint arXiv:2406.06046, 2024 - arxiv.org
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …

A surprisal oracle for when every layer counts

X Hong, S Loáiciga, A Sayeed - arXiv preprint arXiv:2412.03098, 2024 - arxiv.org
Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner directed
approach to training a language model. We proposed the original version of this process in …

[PDF][PDF] Automatic Quality Estimation for Data Selection and Curriculum Learning

H Nguyen, L Yip, J DeBenedetto - csc.villanova.edu
The size of neural models within natural language processing has increased at a rapid pace
in recent years. With this increase in model size comes an increase in the amount of training …