Parameter hub: a rack-scale parameter server for distributed deep neural network training

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org

Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

被引用次数：269 相关文章所有 10 个版本

[PDF] usenix.org

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org

Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

被引用次数：402 相关文章所有 19 个版本

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

被引用次数：378 相关文章所有 13 个版本

[PDF] usenix.org

{ATP}: In-network aggregation for multi-tenant learning

CL Lao, Y Le, K Mahajan, Y Chen, W Wu… - … USENIX Symposium on …, 2021 - usenix.org

Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …

被引用次数：186 相关文章所有 11 个版本

In-network aggregation for data center networks: A survey

A Feng, D Dong, F Lei, J Ma, E Yu, R Wang - Computer Communications, 2023 - Elsevier

Aggregation applications are widely deployed in data centers, such as distributed machine
learning and MapReduce-like framework. These applications typically have large …

被引用次数：11 相关文章所有 2 个版本

[PDF] mlsys.org

Distributed hierarchical gpu parameter server for massive scale deep learning ads systems

W Zhao, D Xie, R Jia, Y Qian, R Ding… - … of Machine Learning …, 2020 - proceedings.mlsys.org

Neural networks of ads systems usually take input from multiple resources, eg query-ad
relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot …

被引用次数：144 相关文章所有 8 个版本

[PDF] kaust.edu.sa

Grace: A compressed communication framework for distributed machine learning

H Xu, CY Ho, AM Abdelmoniem, A Dutta… - 2021 IEEE 41st …, 2021 - ieeexplore.ieee.org

Powerful computer clusters are used nowadays to train complex deep neural networks
(DNN) on large datasets. Distributed training increasingly becomes communication bound …

被引用次数：83 相关文章所有 9 个版本

[PDF] mlsys.org

Priority-based parameter propagation for distributed DNN training

A Jayarajan, J Wei, G Gibson… - Proceedings of …, 2019 - proceedings.mlsys.org

Data parallel training is widely used for scaling distributed deep neural network (DNN)
training. However, the performance benefits are often limited by the communication-heavy …

被引用次数：176 相关文章所有 9 个版本

[PDF] acm.org

Efficient sparse collective communication and its application to accelerate distributed deep learning

J Fei, CY Ho, AN Sahu, M Canini, A Sapio - Proceedings of the 2021 …, 2021 - dl.acm.org

Efficient collective communication is crucial to parallel-computing applications such as
distributed training of large-scale recommendation systems and natural language …

被引用次数：79 相关文章所有 7 个版本

Accelerating decentralized federated learning in heterogeneous edge computing

L Wang, Y Xu, H Xu, M Chen… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

In edge computing (EC), federated learning (FL) enables massive devices to collaboratively
train AI models without exposing local data. In order to avoid the possible bottleneck of the …

被引用次数：37 相关文章所有 3 个版本

高级搜索

QQ 群