Multi-Switch Cooperative In-Network Aggregation for Distributed Deep Learning

MW Su, YY Li, KCJ Lin - GLOBECOM 2023-2023 IEEE Global …, 2023 - ieeexplore.ieee.org
Distributed deep learning (DDL) has recently been proposed to accelerate the training
process of a deep learning model. The core idea is to have multiple workers collaboratively …

Job placement strategy with opportunistic resource sharing for distributed deep learning clusters

H Li, T Sun, X Li, H Xu - … Conference on Smart City; IEEE 6th …, 2020 - ieeexplore.ieee.org
Distributed deep learning frameworks train large deep leaning workload with multiple
training jobs on shared distributed GPU servers. There are new challenges when …

Gr-admm: A communication efficient algorithm based on admm

X Huang, G Wang, Y Lei - 2021 IEEE Intl Conf on Parallel & …, 2021 - ieeexplore.ieee.org
In recent work, the decentralized algorithm has received more attention. In the centralized
network, the worker nodes need to communicate with the central nodes, which results in the …

Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads

CY Ho - 2023 - repository.kaust.edu.sa
Deep Neural Networks (DNNs) find widespread applications across various domains,
including computer vision, recommendation systems, and natural language processing …

MXDAG: A Hybrid Abstraction for Emerging Applications

W Wang, S Das, XC Wu, Z Wang, A Chen… - Proceedings of the 20th …, 2021 - dl.acm.org
Emerging distributed applications, such as microservices, machine learning, big data
analysis, consist of both compute and network tasks. DAG-based abstraction primarily …

Cloud collectives: Towards cloud-aware collectives forml workloads with rank reordering

L Luo, J Nelson, A Krishnamurthy, L Ceze - arXiv preprint arXiv …, 2021 - arxiv.org
ML workloads are becoming increasingly popular in the cloud. Good cloud training
performance is contingent on efficient parameter exchange among VMs. We find that …

Green, yellow, yield: end-host traffic scheduling for distributed deep learning with tensorlights

XS Huang, A Chen, TSE Ng - 2019 IEEE International Parallel …, 2019 - ieeexplore.ieee.org
The recent success of Deep Learning (DL) in a board range of AI services has led to a
surging amount of DL workloads in production clusters. To support DL jobs at scale, the …

Task merging and scheduling for parallel deep learning applications in mobile edge computing

X Long, J Wu, Y Wu, L Chen - 2019 20th International …, 2019 - ieeexplore.ieee.org
Mobile edge computing enables the execution of compute-intensive applications, eg deep
learning applications, on the end devices with limited computation resources. However, the …

Scalable in-network computation for massively-parallel shared-memory processors

B Klenk, N Jiang, LR Dennison… - US Patent 11,171,798, 2021 - Google Patents
A network device configured to perform scalable, in-network computations is described. The
network device is configured to process pull requests and/or push requests from a plurality …

Scaling Machine Learning with a Ring-based Distributed Framework

K Zhao, Y Leng, H Zhang - Proceedings of the 2023 7th International …, 2023 - dl.acm.org
In centralized distributed machine learning systems, communication overhead between
servers and computing nodes has always been an important issue affecting the training …