Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

Flowcon: Elastic flow configuration for containerized deep learning applications

W Zheng, M Tynes, H Gorelick, Y Mao… - Proceedings of the 48th …, 2019 - dl.acm.org
An increasing number of companies are using data analytics to improve their products,
services, and business processes. However, learning knowledge effectively from massive …

Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline

T Um, B Oh, B Seo, M Kweun, G Kim… - Proceedings of the VLDB …, 2023 - dl.acm.org
When training a deep learning (DL) model, input data are pre-processed on CPUs and
transformed into tensors, which are then fed into GPUs for gradient computations of model …

Elastic resource management for deep learning applications in a container cluster

Y Mao, V Sharma, W Zheng, L Cheng… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
The increasing demand for learning from massive datasets is restructuring our economy.
Effective learning, however, involves nontrivial computing resources. Most businesses utilize …

High performance I/O for large scale deep learning

A Aizman, G Maltby, T Breuel - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
Training deep learning (DL) models on petascale datasets is essential for achieving
competitive and state-of-the-art performance in applications such as speech, video analytics …

Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency

N Mungoli - arXiv preprint arXiv:2304.13738, 2023 - arxiv.org
In recent years, the integration of artificial intelligence (AI) and cloud computing has
emerged as a promising avenue for addressing the growing computational demands of AI …

Deep smart scheduling: A deep learning approach for automated big data scheduling over the cloud

G Rjoub, J Bentahar, OA Wahab… - 2019 7th International …, 2019 - ieeexplore.ieee.org
With the widespread adoption of Internet of Thing (IoT) and the exponential growth in the
volumes of generated data, cloud providers tend to receive massive waves of demands on …

Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads

D Shukla, M Sivathanu, S Viswanatha… - arXiv preprint arXiv …, 2022 - arxiv.org
Lowering costs by driving high utilization across deep learning workloads is a crucial lever
for cloud providers. We present Singularity, Microsoft's globally distributed scheduling …

Multi-resource interleaving for deep learning training

Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin - Proceedings of the ACM …, 2022 - dl.acm.org
Training Deep Learning (DL) model requires multiple resource types, including CPUs,
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …

Deep and reinforcement learning for automated task scheduling in large‐scale cloud computing systems

G Rjoub, J Bentahar, O Abdel Wahab… - Concurrency and …, 2021 - Wiley Online Library
Cloud computing is undeniably becoming the main computing and storage platform for
today's major workloads. From Internet of things and Industry 4.0 workloads to big data …