Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

Transom: An efficient fault-tolerant system for training llms

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Y Wang, S Shi, X He, Z Tang, X Pan, Y Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Extensive system scales (ie thousands of GPU/TPUs) and prolonged training periods (ie
months of pretraining) significantly escalate the probability of failures when training large …

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Galore: Memory-efficient llm training by gradient low-rank projection

J Zhao, Z Zhang, B Chen, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Training Large Language Models (LLMs) presents significant memory challenges,
predominantly due to the growing size of weights and optimizer states. Common memory …

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

H Fan, H Zhou, G Huang, P Raman, X Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
Getting large language models (LLMs) to perform well on the downstream tasks requires pre-
training over trillions of tokens. This typically demands a large number of powerful …

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

PID Control-Based Self-Healing to Improve the Robustness of Large Language Models

Z Chen, Z Wang, Y Yang, Q Li, Z Zhang - arXiv preprint arXiv:2404.00828, 2024 - arxiv.org
Despite the effectiveness of deep neural networks in numerous natural language processing
applications, recent findings have exposed the vulnerability of these language models when …

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

X Du, T Gunter, X Kong, M Lee, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while
keeping computation cost constant. When comparing MoE to dense models, prior work …

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

J Huang, L Cui, A Wang, C Yang, X Liao… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) suffer from catastrophic forgetting during continual learning.
Conventional rehearsal-based methods rely on previous training data to retain the model's …