Multi-task learning in natural language processing: An overview

S Chen, Y Zhang, Q Yang - ACM Computing Surveys, 2021 - dl.acm.org
Deep learning approaches have achieved great success in the field of Natural Language
Processing (NLP). However, directly training deep neural models often suffer from overfitting …

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

How good are gpt models at machine translation? a comprehensive evaluation

A Hendy, M Abdelrehim, A Sharaf, V Raunak… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for
natural language generation, but their performance for machine translation has not been …

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

RY Aminabadi, S Rajbhandari, AA Awan… - … Conference for High …, 2022 - ieeexplore.ieee.org
The landscape of transformer model inference is increasingly diverse in model size, model
characteristics, latency and throughput requirements, hardware requirements, etc. With such …

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press
As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

To repeat or not to repeat: Insights from scaling llm under token-crisis

F Xue, Y Fu, W Zhou, Z Zheng… - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent research has highlighted the importance of dataset size in scaling language models.
However, large language models (LLMs) are notoriously token-hungry during pre-training …

The efficiency misnomer

M Dehghani, A Arnab, L Beyer, A Vaswani… - arXiv preprint arXiv …, 2021 - arxiv.org
Model efficiency is a critical aspect of developing and deploying machine learning models.
Inference time and latency directly affect the user experience, and some applications have …

Uni-perceiver-moe: Learning sparse generalist models with conditional moes

J Zhu, X Zhu, W Wang, X Wang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc
To build an artificial neural network like the biological intelligence system, recent works have
unified numerous tasks into a generalist model, which can process various tasks with shared …

Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems

J FitzGerald, S Ananthakrishnan, K Arkoudas… - Proceedings of the 28th …, 2022 - dl.acm.org
We present results from a large-scale experiment on pretraining encoders with non-
embedding parameter counts ranging from 700M to 9.3 B, their subsequent distillation into …