W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and …
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (eg, speech recognition, machine translation). A deep …
Y Bao, Y Peng, C Wu - IEEE INFOCOM 2019-IEEE conference …, 2019 - ieeexplore.ieee.org
Production machine learning (ML) clusters commonly host a variety of distributed ML workloads, eg, speech recognition, machine translation. While server sharing among jobs …
Y Bao, Y Peng, C Wu, Z Li - IEEE INFOCOM 2018-IEEE …, 2018 - ieeexplore.ieee.org
Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the …
Efficient resource scheduling is essential for maximal utilization of expensive deep learning (DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …
In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …
Y Bao, Y Peng, C Wu - IEEE/ACM Transactions on Networking, 2022 - ieeexplore.ieee.org
Nowadays, most leading IT companies host a variety of distributed machine learning (ML) workloads in ML clusters to support AI-driven services, such as speech recognition, machine …
M Yu, Y Tian, B Ji, C Wu, H Rajan… - IEEE INFOCOM 2022 …, 2022 - ieeexplore.ieee.org
Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process …