相关文章- 学术资源搜索

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Transom: An efficient fault-tolerant system for training llms

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Y Wang, S Shi, X He, Z Tang, X Pan, Y Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org

Extensive system scales (ie thousands of GPU/TPUs) and prolonged training periods (ie
months of pretraining) significantly escalate the probability of failures when training large …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Galore: Memory-efficient llm training by gradient low-rank projection

J Zhao, Z Zhang, B Chen, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Training Large Language Models (LLMs) presents significant memory challenges,
predominantly due to the growing size of weights and optimizer states. Common memory …

被引用次数：8 相关文章所有 5 个版本

[PDF] arxiv.org

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

H Fan, H Zhou, G Huang, P Raman, X Fu… - arXiv preprint arXiv …, 2024 - arxiv.org

Getting large language models (LLMs) to perform well on the downstream tasks requires pre-
training over trillions of tokens. This typically demands a large number of powerful …

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

被引用次数：4 相关文章所有 3 个版本

高级搜索

QQ 群

Unicron: Economizing self-healing llm training at scale

Transom: An efficient fault-tolerant system for training llms

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Galore: Memory-efficient llm training by gradient low-rank projection

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

PID Control-Based Self-Healing to Improve the Robustness of Large Language Models

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

相关搜索

引用