作者
Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang
发表日期
2021/11/14
图书
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
页码范围
1-15
简介
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion …
引用总数
学术搜索中的文章
Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International Conference for High …, 2021