作者
Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, Tianwei Zhang
发表日期
2021/11/1
图书
Proceedings of the ACM Symposium on Cloud Computing
页码范围
609-623
简介
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and fairness across users. Different training jobs may require various objectives and demands in terms of completion time. How to efficiently satisfy all these requirements is not extensively studied.
We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs. Chronus is designed based on the unique features of DLT jobs. (1) It leverages the intra-job predictability of DLT processes to efficiently profile jobs and estimate their runtime speed with dynamic resource scaling. (2) It takes advantages of the DLT preemption feature to select jobs with a lease-based training scheme. (3) It considers the placement sensitivity of DLT jobs to allocate resources with new …
引用总数
学术搜索中的文章
W Gao, Z Ye, P Sun, Y Wen, T Zhang - Proceedings of the ACM Symposium on Cloud …, 2021