Distributed Deep Learning in An Edge Computing System

T Sen, H Shen, Z Mehrab - … Conference on Mobile Ad Hoc and …, 2022 - ieeexplore.ieee.org
In many scenarios (eg, hurricanes, earthquake, rural areas), edge devices cannot access
the cloud, which makes the cloud deep learning (DL) training approach inapplicable …

DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

Q Wang, B Sang, H Zhang, M Tang, K Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The cloud is still a popular platform for distributed deep learning (DL) training jobs since
resource sharing in the cloud can improve resource utilization and reduce overall costs …

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

Distributed training for deep learning models on an edge computing network using shielded reinforcement learning

T Sen, H Shen - 2022 IEEE 42nd International Conference on …, 2022 - ieeexplore.ieee.org
With the emergence of edge devices along with their local computation advantage over the
cloud, distributed deep learning (DL) training on edge nodes becomes promising. In such a …

EdgeTuner: Fast scheduling algorithm tuning for dynamic edge-cloud workloads and resources

R Han, S Wen, CH Liu, Y Yuan… - IEEE INFOCOM 2022 …, 2022 - ieeexplore.ieee.org
Edge-cloud jobs are rapidly prevailing in many application domains, posing the challenge of
using both resource-strenuous edge devices and elastic cloud resources. Efficient resource …

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

Chronica: A Data-Imbalance-Aware Scheduler for Distributed Deep Learning

S Maeng, GE Moon, S Park - 2023 IEEE/ACM 23rd …, 2023 - ieeexplore.ieee.org
One of the major challenges in distributed deep learning is attenuating straggler problem.
The straggler increases synchronization latency and significantly inhibits the convergence of …

Enabling DNN acceleration with data and model parallelization over ubiquitous end devices

Y Huang, X Qiao, W Lai, S Dustdar… - IEEE Internet of Things …, 2021 - ieeexplore.ieee.org
Deep neural network (DNN) shows great promise in providing more intelligence to
ubiquitous end devices. However, the existing partition-offloading schemes adopt data …

Selective Preemption of Distributed Deep Learning Training

Y Go, C Shin, J Lee, Y Yoo, G Yang… - 2023 IEEE 16th …, 2023 - ieeexplore.ieee.org
As more distributed deep learning (DDL) jobs run in public clouds, their effective scheduling
becomes a major challenge. Current studies prioritize the execution of jobs with less …

Joint job offloading and resource allocation for distributed deep learning in edge computing

H Wang, X Chen, H Xu, J Liu… - 2019 IEEE 21st …, 2019 - ieeexplore.ieee.org
Under the paradigm of Edge Computing, the enormous data generated at the network edge
can be processed locally. Machine learning methods are often adopted to make full …