Scalable and efficient moe training for multitask multilingual models

S Chen, Y Zhang, Q Yang - ACM Computing Surveys, 2021 - dl.acm.org

Deep learning approaches have achieved great success in the field of Natural Language
Processing (NLP). However, directly training deep neural models often suffer from overfitting …

被引用次数：60 相关文章所有 3 个版本

[PDF] arxiv.org

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

被引用次数：163 相关文章所有 4 个版本

[PDF] arxiv.org

How good are gpt models at machine translation? a comprehensive evaluation

A Hendy, M Abdelrehim, A Sharaf, V Raunak… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for
natural language generation, but their performance for machine translation has not been …

被引用次数：253 相关文章所有 4 个版本

[PDF] arxiv.org

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

RY Aminabadi, S Rajbhandari, AA Awan… - … Conference for High …, 2022 - ieeexplore.ieee.org

The landscape of transformer model inference is increasingly diverse in model size, model
characteristics, latency and throughput requirements, hardware requirements, etc. With such …

被引用次数：175 相关文章所有 6 个版本

[PDF] mlr.press

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press

As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …

被引用次数：167 相关文章所有 5 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：38 相关文章所有 6 个版本

[PDF] neurips.cc

To repeat or not to repeat: Insights from scaling llm under token-crisis

F Xue, Y Fu, W Zhou, Z Zheng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent research has highlighted the importance of dataset size in scaling language models.
However, large language models (LLMs) are notoriously token-hungry during pre-training …

被引用次数：31 相关文章所有 5 个版本

[PDF] arxiv.org

The efficiency misnomer

M Dehghani, A Arnab, L Beyer, A Vaswani… - arXiv preprint arXiv …, 2021 - arxiv.org

Model efficiency is a critical aspect of developing and deploying machine learning models.
Inference time and latency directly affect the user experience, and some applications have …

被引用次数：96 相关文章所有 3 个版本

[PDF] neurips.cc

Uni-perceiver-moe: Learning sparse generalist models with conditional moes

J Zhu, X Zhu, W Wang, X Wang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc

To build an artificial neural network like the biological intelligence system, recent works have
unified numerous tasks into a generalist model, which can process various tasks with shared …

被引用次数：43 相关文章所有 7 个版本

[PDF] acm.org

Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems

J FitzGerald, S Ananthakrishnan, K Arkoudas… - Proceedings of the 28th …, 2022 - dl.acm.org

We present results from a large-scale experiment on pretraining encoders with non-
embedding parameter counts ranging from 700M to 9.3 B, their subsequent distillation into …

被引用次数：65 相关文章所有 7 个版本

高级搜索

QQ 群