作者
Zhuang Wang
发表日期
2023/12
机构
Rice University
简介
The recording-breaking performance of deep neural networks (DNNs) has brought remarkable success to many domains, such as computer vision, natural language processing, and recommendation systems. However, the powerful DNNs come with significant computational complexity due to the increased training data and model sizes. Because the network bandwidth upgrade in GPU clouds has not kept pace with the improvements in the computation capacity of GPUs as well as the fast growth in dataset and model sizes, deep learning practitioners have struggled to efficiently scale up training DNNs. This thesis identifies and addresses research challenges in scaling distributed deep learning (DDL) by optimizing communications in both the data plane and the management plane.