Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Valente, C Senna, P Rito, S Sargento - Applied Sciences, 2023 - mdpi.com

In the scope of smart cities, the sensors scattered throughout the city generate information
that supplies intelligence mechanisms to learn the city's mobility patterns. These patterns are …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …

Fast and scalable all-optical network architecture for distributed deep learning

W Li, G Yuan, Z Wang, G Tan, P Zhang… - Journal of Optical …, 2024 - opg.optica.org

With the ever-increasing size of training models and datasets, network communication has
emerged as a major bottleneck in distributed deep learning training. To address this …

Scalable fully pipelined hardware architecture for in-network aggregated AllReduce communication

Y Liu, J Zhang, S Liu, Q Wang, W Dai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

The Ring-AllReduce framework is currently the most popular solution to deploy industry-
level distributed machine learning tasks. However, only about half of the maximum …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

BaPipe: Exploration of balanced pipeline parallelism for DNN training

L Zhao, R Xu, T Wang, T Tian, X Wang, W Wu… - arXiv preprint arXiv …, 2020 - arxiv.org

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine
learning algorithm increases. To satisfy the requirement of computation and memory of DNN …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Convergence Analysis of Decentralized ASGD

MDL Tosi, M Theobald - arXiv preprint arXiv:2309.03754, 2023 - arxiv.org

Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by
the Machine Learning community. Despite its versatility and excellent performance, the …

被引用次数：1 相关文章所有 2 个版本

Distributed artificial intelligence: review, taxonomy, framework, and reference architecture

N Janbi, I Katib, R Mehmood - Taxonomy, Framework, and …, 2023 - papers.ssrn.com

Artificial intelligence (AI) research and market have grown rapidly in the last few years and
this trend is expected to continue with many potential advancements and innovations in this …

被引用次数：3 相关文章所有 2 个版本

[图书][B] High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment

E Borin, LMA Drummond, JL Gaudiot, A Melo… - 2023 - Springer

This book brings a thorough explanation on the path needed to use cloud computing
technologies to run High-Performance Computing (HPC) applications. Besides presenting …

被引用次数：3 相关文章所有 2 个版本

SaPus: Self-adaptive parameter update strategy for DNN training on Multi-GPU clusters

Z Zhang, C Wang - IEEE Transactions on Parallel and …, 2021 - ieeexplore.ieee.org

Parameter server architecture has been identified as an efficient framework for scaling
DNNs training on clusters. For large-scale deployment, communication becomes the …

被引用次数：5 相关文章所有 2 个版本

[PDF] vldb.org

Using VDMS to index and search 100M images

L Remis, CW Lacewell - Proceedings of the VLDB Endowment, 2021 - dl.acm.org

Data scientists spend most of their time dealing with data preparation, rather than doing
what they know best: build machine learning models and algorithms to solve previously …

被引用次数：4 相关文章所有 5 个版本

高级搜索

QQ 群