A surprisal oracle for active curriculum language modeling

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch

Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

被引用次数：98 相关文章所有 5 个版本

[PDF] arxiv.org

Is Child-Directed Speech Effective Training Data for Language Models?

SY Feng, ND Goodman, MC Frank - arXiv preprint arXiv:2408.03617, 2024 - arxiv.org

While high-performing language models are typically trained on hundreds of billions of
words, human children become fluent language users with a much smaller amount of data …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Z Yu, S Das, C Xiong - arXiv preprint arXiv:2406.06046, 2024 - arxiv.org

Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

A surprisal oracle for when every layer counts

X Hong, S Loáiciga, A Sayeed - arXiv preprint arXiv:2412.03098, 2024 - arxiv.org

Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner directed
approach to training a language model. We proposed the original version of this process in …

[PDF] villanova.edu

[PDF][PDF] Automatic Quality Estimation for Data Selection and Curriculum Learning

H Nguyen, L Yip, J DeBenedetto - csc.villanova.edu

The size of neural models within natural language processing has increased at a rapid pace
in recent years. With this increase in model size comes an increase in the amount of training …

高级搜索

QQ 群