W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io
In public clouds, users must manually select a datacenter region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation …
S Vannuccini, E Prytkova - Journal of Information Technology, 2024 - journals.sagepub.com
In this paper, we offer an original framework to study Artificial Intelligence (AI). The perspective we propose is based on the idea that AI is a system technology, and that a …
Large tech companies are piling up a massive number of GPUs in their server fleets to run diverse machine learning (ML) workloads. However, these expensive devices often suffer …
J Li, H Xu, Y Zhu, Z Liu, C Guo, C Wang - Proceedings of the Eighteenth …, 2023 - dl.acm.org
Organizations often build separate training and inference clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both: inference clusters …
While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive …
Deep learning (DL) jobs use multi-dimensional parallelism, ie, combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may …
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with …