DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Y Kim, K Kim, Y Cho, J Kim, A Khan, KD Kang… - arXiv preprint arXiv …, 2024 - arxiv.org
Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as
the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However …

Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training

Q Anthony, D Dai - 2021 SC Workshops Supplementary …, 2021 - ieeexplore.ieee.org
Deep learning (DL) applications are becoming one of the most important applications for
HPC and cloud systems. The massive datasets and deep neural networks (DNN) used by …

Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models

B Nicolae, J Li, JM Wozniak, G Bosilca… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
In the age of big data, deep learning has emerged as a powerful tool to extract insight and
exploit its value, both in industry and scientific applications. One common pattern emerging …

A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters

S Oh, K Kim, E Seo - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org
The amount of available resources of a cloud is constantly changing. However, the current
distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a …

Stash: A comprehensive stall-centric characterization of public cloud VMs for distributed deep learning

A Sharma, VM Bhasi, S Singh, R Jain… - 2023 IEEE 43rd …, 2023 - ieeexplore.ieee.org
Deep neural networks (DNNs) are increasingly popular owing to their ability to solve
complex problems such as image recognition, autonomous driving, and natural language …

spotDNN: Provisioning Spot Instances for Predictable Distributed DNN Training in the Cloud

R Shang, F Xu, Z Bai, L Chen… - 2023 IEEE/ACM 31st …, 2023 - ieeexplore.ieee.org
Distributed Deep Neural Network (DDNN) training on cloud spot instances is increasingly
compelling as it can significantly save the user budget. To handle unexpected instance …

Srifty: Swift and thrifty distributed neural network training on the cloud

L Luo, P West, P Patel… - … of Machine Learning …, 2022 - proceedings.mlsys.org
Finding the best VM configuration is key to achieve lower cost and higher throughput, two
primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM …

How can we train deep learning models across clouds and continents? an experimental study

A Erben, R Mayer, HA Jacobsen - arXiv preprint arXiv:2306.03163, 2023 - arxiv.org
This paper aims to answer the question: Can deep learning models be cost-efficiently
trained on a global market of spot VMs spanning different data centers and cloud providers …

Multi-tier GPU virtualization for deep learning in cloud-edge systems

J Kennedy, V Sharma, B Varghese… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Accelerator virtualization offers several advantages in the context of cloud-edge computing.
Relatively weak user devices can enhance performance when running workloads by …

[HTML][HTML] : cloud-based cluster provisioning for distributed machine learning

NBD Ta - Cluster Computing, 2019 - Springer
Training large, complex machine learning models such as deep neural networks with big
data requires powerful computing clusters, which are costly to acquire, use and maintain. As …