PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning

K Assogba, E Lima, MM Rafique… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Accurately predicting the training time of deep learning (DL) workloads is critical for
optimizing the utilization of data centers and allocating the required cluster resources for …

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning

M Ritter, F Wolf - Proceedings of the SC'23 Workshops of The …, 2023 - dl.acm.org
With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are
needed to train them efficiently, including distributed training and various model/hybrid …

Prediction of the resource consumption of distributed deep learning systems

G Yang, C Shin, J Lee, Y Yoo, C Yoo - … of the ACM on Measurement and …, 2022 - dl.acm.org
The prediction of the resource consumption for the distributed training of deep learning
models is of paramount importance, as it can inform a priori users how long their training …

Performance modeling and scalability optimization of distributed deep learning systems

F Yan, O Ruwase, Y He, T Chilimbi - Proceedings of the 21th ACM …, 2015 - dl.acm.org
Big deep neural network (DNN) models trained on large amounts of data have recently
achieved the best accuracy on hard tasks, such as image and speech recognition. Training …

Dltap: A network-efficient scheduling method for distributed deep learning workload in containerized cluster environment

W Qiao, Y Li, ZH Wu - ITM Web of Conferences, 2017 - itm-conferences.org
Deep neural networks (DNNs) have recently yielded strong results on a range of
applications. Training these DNNs using a cluster of commodity machines is a promising …

Performance Models for Distributed Deep Learning Training Jobs on Ray

F Filippini, B Lublinsky, M de Bayser… - 2023 49th Euromicro …, 2023 - ieeexplore.ieee.org
Deep Learning applications are pervasive today, and efficient strategies are designed to
reduce the computational time and resource demand of the training process. The Distributed …

Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training

H Zheng, F Xu, L Chen, Z Zhou, F Liu - Proceedings of the 48th …, 2019 - dl.acm.org
It becomes an increasingly popular trend for deep neural networks with large-scale datasets
to be trained in a distributed manner in the cloud. However, widely known as resource …

Multivariate LSTM for Execution Time Prediction in HPC for Distributed Deep Learning Training

T Assali, ZT Ayoub, S Ouni - 2024 IEEE 27th International …, 2024 - ieeexplore.ieee.org
In the last decade, Distributed deep learning has been widely used and introduced in
research for highly computational tasks where time is very critical due to its capability to train …

$ PSeer $: Performance Prediction for Partially Co-located Distributed Deep Learning

W Ding, Z Ding, L Zhao, T Qiu - 2021 IEEE 23rd Int Conf on …, 2021 - ieeexplore.ieee.org
This paper studies the problem of predicting the deep learning (DL) jobs' training time which
is the fundamental guide for training resources allocating and job scheduling. The existing …

Leveraging sparse auto-encoding and dynamic learning rate for efficient cloud workloads prediction

D Alqahtani - IEEE Access, 2023 - ieeexplore.ieee.org
Cloud computing provides simple on-demand access to a centralized shared pool of
computing resources. Performance and efficient utilization of cloud computing resources …