A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

FFCV: Accelerating training by removing data bottlenecks

G Leclerc, A Ilyas, L Engstrom… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present FFCV, a library for easy, fast, resource-efficient training of machine learning
models. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from …

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

J Mohan, A Phanishayee, J Kulkarni… - … USENIX Symposium on …, 2022 - usenix.org
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

Multi-resource interleaving for deep learning training

Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin - Proceedings of the ACM …, 2022 - dl.acm.org
Training Deep Learning (DL) model requires multiple resource types, including CPUs,
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

tf. data: A machine learning data processing framework

DG Murray, J Simsa, A Klimovic, I Indyk - arXiv preprint arXiv:2101.12127, 2021 - arxiv.org
Training machine learning models requires feeding input data for models to ingest. Input
pipelines for machine learning jobs are often challenging to implement efficiently as they …

Cachew: Machine learning input data processing as a service

D Graur, D Aymon, D Kluser, T Albrici… - 2022 USENIX Annual …, 2022 - usenix.org
Processing input data plays a vital role in ML training, impacting accuracy, throughput, and
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …

Characterizing the performance of accelerated jetson edge devices for training deep learning models

P SK, SA Kesanapalli, Y Simmhan - … of the ACM on Measurement and …, 2022 - dl.acm.org
Deep Neural Networks (DNNs) have had a significant impact on domains like autonomous
vehicles and smart cities through low-latency inferencing on edge computing devices close …

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

F Strati, X Ma, A Klimovic - … of the Nineteenth European Conference on …, 2024 - dl.acm.org
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …