A review of sparse expert models in deep learning

W Fedus, J Dean, B Zoph - arXiv preprint arXiv:2209.01667, 2022 - arxiv.org
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in
deep learning. This class of architecture encompasses Mixture-of-Experts, Switch …

Efficient acceleration of deep learning inference on resource-constrained edge devices: A review

MMH Shuvo, SK Islam, J Cheng… - Proceedings of the …, 2022 - ieeexplore.ieee.org
Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted
in breakthroughs in many areas. However, deploying these highly accurate models for data …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Gpt-neox-20b: An open-source autoregressive language model

S Black, S Biderman, E Hallahan, Q Anthony… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model
trained on the Pile, whose weights will be made freely and openly available to the public …

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

RY Aminabadi, S Rajbhandari, AA Awan… - … Conference for High …, 2022 - ieeexplore.ieee.org
The landscape of transformer model inference is increasingly diverse in model size, model
characteristics, latency and throughput requirements, hardware requirements, etc. With such …

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

W Fedus, B Zoph, N Shazeer - Journal of Machine Learning Research, 2022 - jmlr.org
In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …

Pytorch fsdp: experiences on scaling fully sharded data parallel

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

Gshard: Scaling giant models with conditional computation and automatic sharding

D Lepikhin, HJ Lee, Y Xu, D Chen, O Firat… - arXiv preprint arXiv …, 2020 - arxiv.org
Neural network scaling has been critical for improving the model quality in many real-world
machine learning applications with vast amounts of training data and compute. Although this …

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

S Rajbhandari, O Ruwase, J Rasley, S Smith… - Proceedings of the …, 2021 - dl.acm.org
In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …

Megatron-lm: Training multi-billion parameter language models using model parallelism

M Shoeybi, M Patwary, R Puri, P LeGresley… - arXiv preprint arXiv …, 2019 - arxiv.org
Recent work in language modeling demonstrates that training large transformer models
advances the state of the art in Natural Language Processing applications. However, very …