Characterizing deep learning training workloads on alibaba-pai

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：4 相关文章所有 4 个版本

[PDF] usenix.org

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org

With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

被引用次数：180 相关文章所有 3 个版本

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

DAPPLE: A pipelined data parallel approach for training large models

S Fan, Y Rong, C Meng, Z Cao, S Wang… - Proceedings of the 26th …, 2021 - dl.acm.org

It is a challenging task to train large DNN models on sophisticated GPU platforms with
diversified interconnect capabilities. Recently, pipelined training has been proposed as an …

被引用次数：168 相关文章所有 11 个版本

[PDF] arxiv.org

Characterization and prediction of deep learning workloads in large-scale gpu datacenters

Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International …, 2021 - dl.acm.org

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …

被引用次数：94 相关文章所有 6 个版本

[PDF] acm.org

Varuna: scalable, low-cost training of massive deep learning models

S Athlur, N Saran, M Sivathanu, R Ramjee… - Proceedings of the …, 2022 - dl.acm.org

Systems for training massive deep learning models (billions of parameters) today assume
and require specialized" hyperclusters": hundreds or thousands of GPUs wired with …

被引用次数：81 相关文章所有 9 个版本

[PDF] ieee.org

Horus: Interference-aware and prediction-based scheduling in deep learning systems

G Yeung, D Borowiec, R Yang, A Friday… - … on Parallel and …, 2021 - ieeexplore.ieee.org

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped
with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of …

被引用次数：56 相关文章所有 7 个版本

[PDF] arxiv.org

Deep learning training in facebook data centers: Design of scale-up and scale-out systems

M Naumov, J Kim, D Mudigere, S Sridharan… - arXiv preprint arXiv …, 2020 - arxiv.org

Large-scale training is important to ensure high performance and accuracy of machine-
learning models. At Facebook we use many different models, including computer vision …

被引用次数：81 相关文章所有 3 个版本

[PDF] usenix.org

Towards {GPU} utilization prediction for cloud deep learning

G Yeung, D Borowiec, A Friday, R Harper… - 12th USENIX Workshop …, 2020 - usenix.org

Understanding the GPU utilization of Deep Learning (DL) workloads is important for
enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the …

被引用次数：45 相关文章所有 8 个版本

[PDF] hal.science

Performance prediction for convolutional neural networks on edge gpus

H Bouzidi, H Ouarnoughi, S Niar… - Proceedings of the 18th …, 2021 - dl.acm.org

Edge computing is increasingly used for Artificial Intelligence (AI) purposes to meet latency,
privacy, and energy challenges. Convolutional Neural networks (CNN) are more frequently …

被引用次数：26 相关文章所有 3 个版本

高级搜索

QQ 群