作者
Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang
发表日期
2022/11/7
图书
Proceedings of the 13th Symposium on Cloud Computing
页码范围
348-354
简介
The recent breakthrough of foundation model (FM) research raises a new trend to acquire efficient DL models by fine-tuning FMs with low-resource datasets. Current GPU clusters are mainly established to develop DL models by training from scratch. How to tailor a GPU cluster scheduler for FM fine-tuning workloads is still not explored.
We propose Titan, a scheduler to improve the efficiency of FM fine-tuning workloads based on their three distinct features. (1) It takes full advantage of the fixed model structure to estimate the job duration accurately and configure the fine-tuning workload efficiently. (2) The multi-task adaptivity of FMs enables multiple fine-tuning workloads to share the same model parameters, which can significantly reduce the GPU resource consumption. (3) It considers the pipeline parallelism of FM fine-tuning workloads and concurrently executes the parameter transmission and gradient …
引用总数
学术搜索中的文章
W Gao, P Sun, Y Wen, T Zhang - Proceedings of the 13th Symposium on Cloud …, 2022