The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process …
Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence …
Large tech companies are piling up a massive number of GPUs in their server fleets to run diverse machine learning (ML) workloads. However, these expensive devices often suffer …
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and …
B Li, R Arora, S Samsi, T Patel… - … Symposium on High …, 2022 - ieeexplore.ieee.org
Production high-performance computing (HPC) systems are adopting and integrating GPUs into their design to accommodate artificial intelligence (AI), machine learning, and data …
Y Li, X Zhang, T Zeng, J Duan, C Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …
Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org
While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive …
The rapid growth in demand for HPC systems has led to a rise in carbon footprint, which requires urgent intervention. In this work, we present a comprehensive analysis of the …