Simulating 500 million years of evolution with a language model

T Hayes, R Rao, H Akin, NJ Sofroniew, D Oktay, Z Lin… - bioRxiv, 2024 - biorxiv.org
More than three billion years of evolution have produced an image of biology encoded into
the space of natural proteins. Here we show that language models trained on tokens …

Show-o: One single transformer to unify multimodal understanding and generation

J Xie, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Generative verifiers: Reward modeling as next-token prediction

L Zhang, A Hosseini, H Bansal, M Kazemi… - arXiv preprint arXiv …, 2024 - arxiv.org
Verifiers or reward models are often used to enhance the reasoning performance of large
language models (LLMs). A common approach is the Best-of-N method, where N candidate …

Minicpm: Unveiling the potential of small language models with scalable training strategies

S Hu, Y Tu, X Han, C He, G Cui, X Long… - arXiv preprint arXiv …, 2024 - arxiv.org
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion
parameters has been met with concerns regarding resource efficiency and practical …

Massive activations in large language models

M Sun, X Chen, JZ Kolter, Z Liu - arXiv preprint arXiv:2402.17762, 2024 - arxiv.org
We observe an empirical phenomenon in Large Language Models (LLMs)--very few
activations exhibit significantly larger values than others (eg, 100,000 times larger). We call …

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …

Scaling exponents across parameterizations and optimizers

K Everett, L Xiao, M Wortsman, AA Alemi… - arXiv preprint arXiv …, 2024 - arxiv.org
Robust and effective scaling of models from small to large width typically requires the
precise adjustment of many algorithmic and architectural details, such as parameterization …

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

Disentangling the causes of plasticity loss in neural networks

C Lyle, Z Zheng, K Khetarpal, H van Hasselt… - arXiv preprint arXiv …, 2024 - arxiv.org
Underpinning the past decades of work on the design, initialization, and optimization of
neural networks is a seemingly innocuous assumption: that the network is trained on a\textit …

Deconstructing what makes a good optimizer for language models

R Zhao, D Morwani, D Brandfonbrener, N Vyas… - arXiv preprint arXiv …, 2024 - arxiv.org
Training language models becomes increasingly expensive with scale, prompting numerous
attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer …