Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Xu, Z Liu, B Chen, S Zhong, Y Tang, W Jue… - Forty-first International … - openreview.net

Model compression is one of the most popular approaches to improve the accessibility of
Large Language Models (LLMs) by reducing their memory footprint. However, the gaining of …

[PDF] ntu.edu.sg

Building efficient and practical machine learning systems

Q Hu - 2023 - dr.ntu.edu.sg

With the widespread adoption of deep learning (DL) applications in recent years, training DL
models has become increasingly prevalent. Nevertheless, training these models is typically …

[PDF] zte.com.cn

[PDF][PDF] A survey on large model training technologies

HD TIAN, MZ ZHANG, R CHANG - ZTE technology journal, 2024 - zte.com.cn

Achieving efficient training has become one of the key factors affecting the popularization of
large model applications. The main technologies of efficient training of large models are …

PancakeFS: A Write Efficiently and Read Optimized Filesystem

Y Wang, Y Liu, Z Chen - 2024 4th International Conference on …, 2024 - ieeexplore.ieee.org

With the development of emerging technologies such as cloud computing and large AI
models (such as LLM), many applications have placed higher demands on the intensive …

[PDF] sjtu.edu.cn

[PDF][PDF] LubeRDMA: A Fail-safe Mechanism of RDMA

S Lin, Q Yang, Z Yang, Y Wang, S Zhao - 2024 - jhc.sjtu.edu.cn

Recent years have witnessed a wide adoption of Remote Direct Memory Access (RDMA) to
accelerate distributed systems. As the scale of distributed applications keeps increasing …

[PDF] rice.edu

[PDF][PDF] TS Eugene Ng

Z Wang - 2023 - repository.rice.edu

Deep neural networks (DNNs) have achieved unparalleled performance in numerous fields,
including computer vision, natural language processing, and recommendation systems …

[PDF] github.io

[PDF][PDF] Optimizing Data I/O for LLM Datasets on Remote Storage

T Zhong, J Zhao, X Guo, Q Su, G Fox - luosuu.github.io

Training large language models (LLMs) demands increasingly larger datasets for optimal
performance [13]. In practice, these datasets may include hundreds of terabytes (TB) or even …

高级搜索

QQ 群