Litz: Elastic framework for {High-Performance} distributed machine learning

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

被引用次数：260 相关文章所有 8 个版本

[HTML] sciencedirect.com

[HTML][HTML] Deep neural networks in the cloud: Review, applications, challenges and research directions

KY Chan, B Abu-Salih, R Qaddoura, AZ Ala'M… - Neurocomputing, 2023 - Elsevier

Deep neural networks (DNNs) are currently being deployed as machine learning technology
in a wide range of important real-world applications. DNNs consist of a huge number of …

被引用次数：52 相关文章所有 6 个版本

[PDF] usenix.org

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

被引用次数：196 相关文章所有 15 个版本

[PDF] usenix.org

Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}

J Thorpe, P Zhao, J Eyolfson, Y Qiao, Z Jia… - … USENIX Symposium on …, 2023 - usenix.org

DNN models across many domains continue to grow in size, resulting in high resource
requirements for effective training, and unpalatable (and often unaffordable) costs for …

被引用次数：57 相关文章所有 14 个版本

[PDF] usenix.org

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org

When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

被引用次数：84 相关文章所有 9 个版本

[PDF] arxiv.org

DL2: A deep learning-driven scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Efficient resource scheduling is essential for maximal utilization of expensive deep learning
(DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …

被引用次数：97 相关文章所有 6 个版本

[PDF] arxiv.org

Fleet: Online federated learning via staleness awareness and performance prediction

G Damaskinos, R Guerraoui, AM Kermarrec… - ACM Transactions on …, 2022 - dl.acm.org

Federated learning (FL) is very appealing for its privacy benefits: essentially, a global model
is trained with updates computed on mobile devices while keeping the data of users local …

被引用次数：76 相关文章所有 20 个版本

[PDF] nsf.gov

Distributed deep learning on data systems: a comparative analysis of approaches

Y Zhang, F Mcquillan, N Jayaram, N Kak… - Proceedings of the …, 2021 - par.nsf.gov

Deep learning (DL) is growing in popularity for many data analytics applications, including
among enterprises. Large business-critical datasets in such settings typically reside in …

被引用次数：40 相关文章所有 6 个版本

[PDF] arxiv.org

Crossbow: Scaling deep learning with small batch sizes on multi-gpu servers

A Koliousis, P Watcharapichat, M Weidlich… - arXiv preprint arXiv …, 2019 - arxiv.org

Deep learning models are trained on servers with many GPUs, and training must scale with
the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel …

被引用次数：84 相关文章所有 14 个版本

[PDF] mlsys.org

Resource elasticity in distributed deep learning

A Or, H Zhang, M Freedman - Proceedings of Machine …, 2020 - proceedings.mlsys.org

Elasticity—scaling out or in depending upon resource demand or availability—allows a
system to improve its efficiency or performance. This leads to potentially significant cost …

被引用次数：61 相关文章所有 8 个版本

高级搜索

QQ 群