Small-scale proxies for large-scale transformer training instabilities

T Hayes, R Rao, H Akin, NJ Sofroniew, D Oktay, Z Lin… - bioRxiv, 2024 - biorxiv.org

More than three billion years of evolution have produced an image of biology encoded into
the space of natural proteins. Here we show that language models trained on tokens …

被引用次数：102 相关文章所有 2 个版本

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J Xie, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

被引用次数：62 相关文章所有 4 个版本

[PDF] arxiv.org

Generative verifiers: Reward modeling as next-token prediction

L Zhang, A Hosseini, H Bansal, M Kazemi… - arXiv preprint arXiv …, 2024 - arxiv.org

Verifiers or reward models are often used to enhance the reasoning performance of large
language models (LLMs). A common approach is the Best-of-N method, where N candidate …

被引用次数：35 相关文章所有 4 个版本

[PDF] arxiv.org

Minicpm: Unveiling the potential of small language models with scalable training strategies

S Hu, Y Tu, X Han, C He, G Cui, X Long… - arXiv preprint arXiv …, 2024 - arxiv.org

The burgeoning interest in developing Large Language Models (LLMs) with up to trillion
parameters has been met with concerns regarding resource efficiency and practical …

被引用次数：160 相关文章所有 2 个版本

[PDF] arxiv.org

Massive activations in large language models

M Sun, X Chen, JZ Kolter, Z Liu - arXiv preprint arXiv:2402.17762, 2024 - arxiv.org

We observe an empirical phenomenon in Large Language Models (LLMs)--very few
activations exhibit significantly larger values than others (eg, 100,000 times larger). We call …

被引用次数：53 相关文章所有 3 个版本

[PDF] arxiv.org

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling exponents across parameterizations and optimizers

K Everett, L Xiao, M Wortsman, AA Alemi… - arXiv preprint arXiv …, 2024 - arxiv.org

Robust and effective scaling of models from small to large width typically requires the
precise adjustment of many algorithmic and architectural details, such as parameterization …

被引用次数：12 相关文章

[PDF] arxiv.org

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

被引用次数：7 相关文章

[PDF] arxiv.org

Disentangling the causes of plasticity loss in neural networks

C Lyle, Z Zheng, K Khetarpal, H van Hasselt… - arXiv preprint arXiv …, 2024 - arxiv.org

Underpinning the past decades of work on the design, initialization, and optimization of
neural networks is a seemingly innocuous assumption: that the network is trained on a\textit …

被引用次数：24 相关文章所有 2 个版本

[PDF] arxiv.org

Deconstructing what makes a good optimizer for language models

R Zhao, D Morwani, D Brandfonbrener, N Vyas… - arXiv preprint arXiv …, 2024 - arxiv.org

Training language models becomes increasingly expensive with scale, prompting numerous
attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer …

被引用次数：8 相关文章所有 5 个版本

高级搜索

QQ 群