Analyzing and mitigating data stalls in DNN training

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

被引用次数：4 相关文章所有 2 个版本

[PDF] thecvf.com

FFCV: Accelerating training by removing data bottlenecks

G Leclerc, A Ilyas, L Engstrom… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present FFCV, a library for easy, fast, resource-efficient training of machine learning
models. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from …

被引用次数：47 相关文章所有 7 个版本

[PDF] usenix.org

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

J Mohan, A Phanishayee, J Kulkarni… - … USENIX Symposium on …, 2022 - usenix.org

Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …

被引用次数：51 相关文章所有 3 个版本

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org

Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

被引用次数：49 相关文章所有 2 个版本

[PDF] yibozhu.com

Multi-resource interleaving for deep learning training

Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin - Proceedings of the ACM …, 2022 - dl.acm.org

Training Deep Learning (DL) model requires multiple resource types, including CPUs,
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …

被引用次数：39 相关文章所有 4 个版本

[PDF] arxiv.org

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

被引用次数：61 相关文章所有 4 个版本

[PDF] arxiv.org

tf. data: A machine learning data processing framework

DG Murray, J Simsa, A Klimovic, I Indyk - arXiv preprint arXiv:2101.12127, 2021 - arxiv.org

Training machine learning models requires feeding input data for models to ingest. Input
pipelines for machine learning jobs are often challenging to implement efficiently as they …

被引用次数：77 相关文章所有 10 个版本

[PDF] usenix.org

Cachew: Machine learning input data processing as a service

D Graur, D Aymon, D Kluser, T Albrici… - 2022 USENIX Annual …, 2022 - usenix.org

Processing input data plays a vital role in ML training, impacting accuracy, throughput, and
cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with …

被引用次数：31 相关文章所有 5 个版本

[PDF] iisc.ac.in

Characterizing the performance of accelerated jetson edge devices for training deep learning models

P SK, SA Kesanapalli, Y Simmhan - … of the ACM on Measurement and …, 2022 - dl.acm.org

Deep Neural Networks (DNNs) have had a significant impact on domains like autonomous
vehicles and smart cities through low-latency inferencing on edge computing devices close …

被引用次数：20 相关文章所有 4 个版本

[PDF] acm.org

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

F Strati, X Ma, A Klimovic - … of the Nineteenth European Conference on …, 2024 - dl.acm.org

GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …

被引用次数：7 相关文章所有 3 个版本

高级搜索

QQ 群