Cloud-edge computing-based ICICOS framework for industrial automation and artificial intelligence: a survey

W Su, G Xu, Z He, IK Machica, V Quimno… - Journal of Circuits …, 2023 - World Scientific
Industrial Automation (IA) and Artificial Intelligence (AI) need an integrated platform. Due to
the uncertainty of the time required for training or reasoning tasks, it is difficult to ensure the …

Online job scheduling for distributed machine learning in optical circuit switch networks

L Liu, H Yu, G Sun, H Zhou, Z Li, S Luo - Knowledge-Based Systems, 2020 - Elsevier
Networking has become a well-known performance bottleneck for distributed machine
learning (DML). Although lots of works have focused on accelerating the communication …

DeepReduce: A sparse-tensor communication framework for distributed deep learning

K Kostopoulou, H Xu, A Dutta, X Li, A Ntoulas… - arXiv preprint arXiv …, 2021 - arxiv.org
Sparse tensors appear frequently in distributed deep learning, either as a direct artifact of
the deep neural network's gradients, or as a result of an explicit sparsification process …

PaddleBox: Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

W Zhao, X Jiao, M Hu, X Li, X Zhang… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Click-through rate (CTR) prediction is one of the most crucial components in the online
advertising industry. In order to produce a personalized CTR prediction, an industry-level …

Preemptive switch memory usage to accelerate training jobs with shared in-network aggregation

H Wang, Y Qin, CL Lao, Y Le, W Wu… - 2023 IEEE 31st …, 2023 - ieeexplore.ieee.org
Recent works introduce In-Network Aggregation (INA) for distributed training (DT), which
moves the gradient summation into network programmable switches. INA can reduce the …

GOAT: Gradient scheduling with collaborative in-network aggregation for distributed training

J Fang, G Zhao, H Xu, Z Yu, B Shen… - 2023 IEEE/ACM 31st …, 2023 - ieeexplore.ieee.org
The surging scale of distributed training (DT) incurs significant communication overhead in
datacenters, while a promising solution is in-network aggregation (INA). It leverages …

Following the correct direction: Renovating sparsified SGD towards global optimization in distributed edge learning

W Ning, H Sun, X Fu, X Yang, Q Qi… - IEEE Journal on …, 2021 - ieeexplore.ieee.org
Distributed edge learning collaborates powerful edge devices to train a shared global
model. Since the frequent communication between the server and workers is very …

Accelerating Distributed Deep Learning using Lossless Homomorphic Compression

H Li, Y Xu, J Chen, R Dwivedula, W Wu, K He… - arXiv preprint arXiv …, 2024 - arxiv.org
As deep neural networks (DNNs) grow in complexity and size, the resultant increase in
communication overhead during distributed training has become a significant bottleneck …

[PDF][PDF] 分布式机器学习系统网络性能优化研究进展

王帅, 李丹 - 计算机学报, 2022 - 159.226.43.17
摘要以机器学习为代表的人工智能技术需要对海量数据进行处理, 对底层算力要求极高.
分布式机器学习通过将计算任务分布式地部署到多个计算节点来加快模型的训练速度 …

Prophet: Speeding up distributed dnn training with predictable communication scheduling

Z Zhang, Q Qi, R Shang, L Chen, F Xu - Proceedings of the 50th …, 2021 - dl.acm.org
Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently
become increasingly compelling, as the DNN model gets complex and the training dataset …