Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

C Chen, X Li, Q Zhu, J Duan, P Sun, X Zhang… - Proceedings of the 29th …, 2024 - dl.acm.org
Efficiently training large language models (LLMs) necessitates the adoption of hybrid
parallel methods, integrating multiple communications collectives within distributed …

vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training

J Bang, Y Choi, M Kim, Y Kim, M Rhu - arXiv preprint arXiv:2312.12391, 2023 - arxiv.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding

Q Chen, D Gu, G Wang, X Chen, YT Xiong… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) with long sequences begin to power more and more
fundamentally new applications we use every day. Existing methods for long-sequence LLM …

Optimizing distributed training on frontier for large language models

S Dash, IR Lyngaas, J Yin, X Wang… - ISC High …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have demonstrated remarkable success as foundational
models, benefiting various downstream applications through fine-tuning. Loss scaling …

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

J Hagemann, S Weinbach, K Dobler, M Schall… - arXiv preprint arXiv …, 2023 - arxiv.org
Efficiently training large language models requires parallelizing across hundreds of
hardware accelerators and invoking various compute and memory optimizations. When …

Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Y Dai, T Dharamsi, B Hsu, T Song, H Firooz - arXiv preprint arXiv …, 2024 - arxiv.org
Training extremely large language models with billions of parameters is a computationally
intensive task that pushes the limits of current data parallel training systems. While …

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

S Ao, W Zhao, X Han, C Yang, Z Liu, C Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of large language models (LLMs) relies heavily on distributed training
strategies, among which pipeline parallelism plays a crucial role. As LLMs' training …

CO2: Efficient distributed training with full communication-computation overlap

W Sun, Z Qin, W Sun, S Li, D Li, X Shen, Y Qiao… - arXiv preprint arXiv …, 2024 - arxiv.org
The fundamental success of large language models hinges upon the efficacious
implementation of large-scale distributed training techniques. Nevertheless, building a vast …

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

J Dong, B Luo, J Zhang, P Zhang, F Feng, Y Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel
training techniques, involving the deployment of thousands of GPUs to train a single model …

Optimus-CC: Efficient large NLP model training with 3D parallelism aware communication compression

J Song, J Yim, J Jung, H Jang, HJ Kim, Y Kim… - Proceedings of the 28th …, 2023 - dl.acm.org
In training of modern large natural language processing (NLP) models, it has become a
common practice to split models using 3D parallelism to multiple GPUs. Such technique …