F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload …
W Li, G Yuan, Z Wang, G Tan, P Zhang… - Journal of Optical …, 2024 - opg.optica.org
With the ever-increasing size of training models and datasets, network communication has emerged as a major bottleneck in distributed deep learning training. To address this …
Y Liu, J Zhang, S Liu, Q Wang, W Dai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
The Ring-AllReduce framework is currently the most popular solution to deploy industry- level distributed machine learning tasks. However, only about half of the maximum …
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN …
Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the …
Artificial intelligence (AI) research and market have grown rapidly in the last few years and this trend is expected to continue with many potential advancements and innovations in this …
This book brings a thorough explanation on the path needed to use cloud computing technologies to run High-Performance Computing (HPC) applications. Besides presenting …
Z Zhang, C Wang - IEEE Transactions on Parallel and …, 2021 - ieeexplore.ieee.org
Parameter server architecture has been identified as an efficient framework for scaling DNNs training on clusters. For large-scale deployment, communication becomes the …
L Remis, CW Lacewell - Proceedings of the VLDB Endowment, 2021 - dl.acm.org
Data scientists spend most of their time dealing with data preparation, rather than doing what they know best: build machine learning models and algorithms to solve previously …