The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

Rho-1: Not all tokens are what you need

Z Lin, Z Gou, Y Gong, X Liu, Y Shen, R Xu, C Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that" Not all tokens in a …

Observational Scaling Laws and the Predictability of Language Model Performance

Y Ruan, CJ Maddison, T Hashimoto - arXiv preprint arXiv:2405.10938, 2024 - arxiv.org
Understanding how language model performance varies with scale is critical to benchmark
and algorithm development. Scaling laws are one approach to building this understanding …

Neural Scaling Laws for Embodied AI

S Sartor, N Thompson - arXiv preprint arXiv:2405.14005, 2024 - arxiv.org
Scaling laws have driven remarkable progress across machine learning domains like
language modeling and computer vision. However, the exploration of scaling laws in …

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

M Piau, R Lotufo, R Nogueira - arXiv preprint arXiv:2406.10806, 2024 - arxiv.org
Despite advancements in Natural Language Processing (NLP) and the growing availability
of pretrained models, the English language remains the primary focus of model …

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

C Tao, Q Liu, L Dou, N Muennighoff, Z Wan… - arXiv preprint arXiv …, 2024 - arxiv.org
Research on scaling large language models (LLMs) has primarily focused on model
parameters and training data size, overlooking the role of vocabulary size.% Intuitively …

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

X Niu, B Bai, L Deng, W Han - arXiv preprint arXiv:2405.08707, 2024 - arxiv.org
Increasing the size of a Transformer model does not always lead to enhanced performance.
This phenomenon cannot be explained by the empirical scaling laws. Furthermore …

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org
What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …

Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models

Z Ankner, C Blakeney, K Sreenivasan, M Marion… - ICLR 2024 Workshop on … - openreview.net
In this work, we consider whether pretraining on a pruned high-quality subset of a large-
scale text dataset can improve LLM performance. While existing work has shown that …