Glm-130b: An open bilingual pre-trained model

A Zeng, X Liu, Z Du, Z Wang, H Lai, M Ding… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model
with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as …

Palm: Scaling language modeling with pathways

A Chowdhery, S Narang, J Devlin, M Bosma… - Journal of Machine …, 2023 - jmlr.org
Large language models have been shown to achieve remarkable performance across a
variety of natural language tasks using few-shot learning, which drastically reduces the …

Enabling resource-efficient aiot system with cross-level optimization: A survey

S Liu, B Guo, C Fang, Z Wang, S Luo… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The emerging field of artificial intelligence of things (AIoT, AI+ IoT) is driven by the
widespread use of intelligent infrastructures and the impressive success of deep learning …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

Reducing activation recomputation in large transformer models

VA Korthikanti, J Casper, S Lym… - Proceedings of …, 2023 - proceedings.mlsys.org
Training large transformer models is one of the most important computational challenges of
modern AI. In this paper, we show how to significantly accelerate the training of large …

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

S Rajbhandari, O Ruwase, J Rasley, S Smith… - Proceedings of the …, 2021 - dl.acm.org
In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …

Decentralized training of foundation models in heterogeneous environments

B Yuan, Y He, J Davis, T Zhang… - Advances in …, 2022 - proceedings.neurips.cc
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …

Ring attention with blockwise transformers for near-infinite context

H Liu, M Zaharia, P Abbeel - arXiv preprint arXiv:2310.01889, 2023 - arxiv.org
Transformers have emerged as the architecture of choice for many state-of-the-art AI
models, showcasing exceptional performance across a wide range of AI applications …

Skeleton-of-thought: Large language models can do parallel decoding

X Ning, Z Lin, Z Zhou, H Yang, Y Wang - arXiv preprint arXiv:2307.15337, 2023 - arxiv.org
This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …