- 学术资源搜索

Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

被引用次数：229 相关文章所有 8 个版本

[PDF] acm.org

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：4 相关文章所有 4 个版本

[PDF] usenix.org

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org

With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

被引用次数：197 相关文章所有 3 个版本

[PDF] usenix.org

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org

Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

被引用次数：278 相关文章所有 10 个版本

[PDF] usenix.org

Gandiva: Introspective cluster scheduling for deep learning

W Xiao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

被引用次数：515 相关文章所有 12 个版本

[PDF] usenix.org

{Heterogeneity-Aware} cluster scheduling policies for deep learning workloads

D Narayanan, K Santhanam, F Kazhamiaka… - … USENIX Symposium on …, 2020 - usenix.org

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been
increasingly deployed to train deep learning models. These accelerators exhibit …

被引用次数：194 相关文章所有 12 个版本

[PDF] usenix.org

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org

Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

被引用次数：417 相关文章所有 19 个版本

[PDF] arxiv.org

Making ai less" thirsty": Uncovering and addressing the secret water footprint of ai models

P Li, J Yang, MA Islam, S Ren - arXiv preprint arXiv:2304.03271, 2023 - arxiv.org

The growing carbon footprint of artificial intelligence (AI) models, especially large ones such
as GPT-3, has been undergoing public scrutiny. Unfortunately, however, the equally …

被引用次数：72 相关文章所有 7 个版本

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

被引用次数：386 相关文章所有 13 个版本

[PDF] usenix.org

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org

With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

被引用次数：354 相关文章所有 11 个版本

高级搜索

QQ 群

Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

Deep learning workload scheduling in gpu datacenters: A survey

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Gandiva: Introspective cluster scheduling for deep learning

{Heterogeneity-Aware} cluster scheduling policies for deep learning workloads

Scaling distributed machine learning with {In-Network} aggregation

Making ai less" thirsty": Uncovering and addressing the secret water footprint of ai models

Tiresias: A {GPU} cluster manager for distributed deep learning

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

引用