Elan: Towards generic and efficient elastic training for deep learning

L Xie, J Zhai, B Wu, Y Wang, X Zhang… - 2020 IEEE 40th …, 2020 - ieeexplore.ieee.org
Showing a promising future in improving resource utilization and accelerating training,
elastic deep learning training has been attracting more and more attention recently …

Adaptive precision training for resource constrained devices

T Huang, LUO Tao, JT Zhou - 2020 IEEE 40th International …, 2020 - ieeexplore.ieee.org
Learn in-situ is a growing trend for Edge AI. Training deep neural network (DNN) on edge
devices is challenging because both energy and memory are constrained. Low precision …

Eddl: A distributed deep learning system for resource-limited edge computing environment

P Hao, Y Zhang - 2021 IEEE/ACM Symposium on Edge …, 2021 - ieeexplore.ieee.org
This paper investigates the problem of performing distributed deep learning (DDL) to train
machine learning (ML) models at the edge with resource-constrained embedded devices …

Easyscale: Accuracy-consistent elastic training for deep learning

M Li, W Xiao, B Sun, H Zhao, H Yang, S Ren… - arXiv preprint arXiv …, 2022 - arxiv.org
Distributed synchronized GPU training is commonly used for deep learning. The resource
constraint of using fixed GPUs makes large-scale deep learning training jobs suffer, and …

An optimal resource allocator of elastic training for deep learning jobs on cloud

L Hu, J Zhu, Z Zhou, R Cheng, X Bai… - arXiv preprint arXiv …, 2021 - arxiv.org
Cloud training platforms, such as Amazon Web Services and Huawei Cloud provide users
with computational resources to train their deep learning jobs. Elastic training is a service …

DyVEDeep Dynamic Variable Effort Deep Neural Networks

S Ganapathy, S Venkataramani, G Sriraman… - ACM Transactions on …, 2020 - dl.acm.org
Deep Neural Networks (DNNs) have advanced the state-of-the-art in a variety of machine
learning tasks and are deployed in increasing numbers of products and services. However …

Accelerating data loading in deep neural network training

CC Yang, G Cong - 2019 IEEE 26th International Conference …, 2019 - ieeexplore.ieee.org
Data loading can dominate deep neural network training time on large-scale systems. We
present a comprehensive study on accelerating data loading performance in large-scale …

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …

BePOCH: Improving federated learning performance in resource-constrained computing devices

L Ibraimi, M Selimi, F Freitag - 2021 IEEE Global …, 2021 - ieeexplore.ieee.org
Inference with trained machine learning models is now pos-sible with small computing
devices while only a few years ago it was run mostly in the cloud only. The recent technique …

Speeding up deep learning with transient servers

S Li, RJ Walls, L Xu, T Guo - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce
the training time of deep learning models by using a cluster of GPU servers. While such …