Memory-efficient pipeline-parallel dnn training

A Zeng, X Liu, Z Du, Z Wang, H Lai, M Ding… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model
with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as …

被引用次数：363 相关文章所有 5 个版本

[PDF] jmlr.org

Palm: Scaling language modeling with pathways

A Chowdhery, S Narang, J Devlin, M Bosma… - Journal of Machine …, 2023 - jmlr.org

Large language models have been shown to achieve remarkable performance across a
variety of natural language tasks using few-shot learning, which drastically reduces the …

被引用次数：4056 相关文章所有 10 个版本

[PDF] arxiv.org

Enabling resource-efficient aiot system with cross-level optimization: A survey

S Liu, B Guo, C Fang, Z Wang, S Luo… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The emerging field of artificial intelligence of things (AIoT, AI+ IoT) is driven by the
widespread use of intelligent infrastructures and the impressive success of deep learning …

被引用次数：8 相关文章所有 6 个版本

[PDF] arxiv.org

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org

AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

被引用次数：3375 相关文章所有 2 个版本

[PDF] arxiv.org

Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org

Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

被引用次数：506 相关文章所有 11 个版本

[PDF] mlsys.org

Reducing activation recomputation in large transformer models

VA Korthikanti, J Casper, S Lym… - Proceedings of …, 2023 - proceedings.mlsys.org

Training large transformer models is one of the most important computational challenges of
modern AI. In this paper, we show how to significantly accelerate the training of large …

被引用次数：141 相关文章所有 3 个版本

[PDF] arxiv.org

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

S Rajbhandari, O Ruwase, J Rasley, S Smith… - Proceedings of the …, 2021 - dl.acm.org

In the last three years, the largest dense deep learning models have grown over 1000x to
reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 …

被引用次数：260 相关文章所有 5 个版本

[PDF] neurips.cc

Decentralized training of foundation models in heterogeneous environments

B Yuan, Y He, J Davis, T Zhang… - Advances in …, 2022 - proceedings.neurips.cc

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …

被引用次数：61 相关文章所有 10 个版本

[PDF] arxiv.org

Ring attention with blockwise transformers for near-infinite context

H Liu, M Zaharia, P Abbeel - arXiv preprint arXiv:2310.01889, 2023 - arxiv.org

Transformers have emerged as the architecture of choice for many state-of-the-art AI
models, showcasing exceptional performance across a wide range of AI applications …

被引用次数：56 相关文章所有 4 个版本

[PDF] arxiv.org

Skeleton-of-thought: Large language models can do parallel decoding

X Ning, Z Lin, Z Zhou, H Yang, Y Wang - arXiv preprint arXiv:2307.15337, 2023 - arxiv.org

This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …

被引用次数：47 相关文章所有 7 个版本

高级搜索

QQ 群