Language models scale reliably with over-training and on downstream tasks

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org

Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Rho-1: Not all tokens are what you need

Z Lin, Z Gou, Y Gong, X Liu, Y Shen, R Xu, C Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that" Not all tokens in a …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

Observational Scaling Laws and the Predictability of Language Model Performance

Y Ruan, CJ Maddison, T Hashimoto - arXiv preprint arXiv:2405.10938, 2024 - arxiv.org

Understanding how language model performance varies with scale is critical to benchmark
and algorithm development. Scaling laws are one approach to building this understanding …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Neural Scaling Laws for Embodied AI

S Sartor, N Thompson - arXiv preprint arXiv:2405.14005, 2024 - arxiv.org

Scaling laws have driven remarkable progress across machine learning domains like
language modeling and computer vision. However, the exploration of scaling laws in …

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

[PDF] arxiv.org

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

M Piau, R Lotufo, R Nogueira - arXiv preprint arXiv:2406.10806, 2024 - arxiv.org

Despite advancements in Natural Language Processing (NLP) and the growing availability
of pretrained models, the English language remains the primary focus of model …

[PDF] arxiv.org

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

C Tao, Q Liu, L Dou, N Muennighoff, Z Wan… - arXiv preprint arXiv …, 2024 - arxiv.org

Research on scaling large language models (LLMs) has primarily focused on model
parameters and training data size, overlooking the role of vocabulary size.% Intuitively …

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

X Niu, B Bai, L Deng, W Han - arXiv preprint arXiv:2405.08707, 2024 - arxiv.org

Increasing the size of a Transformer model does not always lead to enhanced performance.
This phenomenon cannot be explained by the empirical scaling laws. Furthermore …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org

What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …

[PDF] openreview.net

Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models

Z Ankner, C Blakeney, K Sreenivasan, M Marion… - ICLR 2024 Workshop on … - openreview.net

In this work, we consider whether pretraining on a pruned high-quality subset of a large-
scale text dataset can improve LLM performance. While existing work has shown that …

高级搜索

QQ 群