相关文章- 学术资源搜索

Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models

W Won, S Rashidi, S Srinivasan, T Krishna - arXiv preprint arXiv …, 2021 - arxiv.org

Deep Neural Networks have gained significant attraction due to their wide applicability in
different domains. DNN sizes and training samples are constantly growing, making training …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

S Rashidi, W Won, S Srinivasan, P Gupta… - arXiv preprint arXiv …, 2024 - arxiv.org

Distributed Deep Neural Network (DNN) training is a technique to reduce the training
overhead by distributing the training tasks into multiple accelerators, according to a …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org

Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

被引用次数：27 相关文章所有 7 个版本

[PDF] gatech.edu

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

S Rashidi, S Sridharan, S Srinivasan… - … Analysis of Systems …, 2020 - ieeexplore.ieee.org

Modern Deep Learning systems heavily rely on distributed training over high-performance
accelerator (eg, TPU, GPU)-based hardware platforms. Examples today include Google's …

被引用次数：53 相关文章所有 6 个版本

[PDF] msjansen.com

DDLBench: towards a scalable benchmarking infrastructure for distributed deep learning

M Jansen, V Codreanu… - 2020 IEEE/ACM Fourth …, 2020 - ieeexplore.ieee.org

Due to its many applications across various fields of research, engineering, and daily life,
deep learning has seen a surge in popularity. Therefore, larger and more expressive models …

被引用次数：8 相关文章所有 7 个版本

[PDF] arxiv.org

Parameter box: High performance parameter servers for efficient distributed deep neural network training

L Luo, J Nelson, L Ceze, A Phanishayee… - arXiv preprint arXiv …, 2018 - arxiv.org

Most work in the deep learning systems community has focused on faster inference, but
arriving at a trained model requires lengthy experiments. Accelerating training lets …

被引用次数：14 相关文章所有 4 个版本

LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

W Won, S Rashidi, S Srinivasan… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

As model sizes in machine learning continue to scale, distributed training is necessary to
accommodate model weights within each device and to reduce training time. However, this …

被引用次数：1 相关文章所有 2 个版本

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

TT Nguyen, M Wahib - 2021 IEEE/ACM 21st International …, 2021 - ieeexplore.ieee.org

Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

COMET: A comprehensive cluster design methodology for distributed deep learning training

DK Kadiyala, S Rashidi, T Heo, AR Bambhaniya… - arXiv preprint arXiv …, 2022 - arxiv.org

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of
specialized, high-end nodes to train. Designing such clusters to maximize both performance …

被引用次数：5 相关文章所有 3 个版本

Eflops: Algorithm and system co-design for a high performance distributed training platform

J Dong, Z Cao, T Zhang, J Ye, S Wang… - … Symposium on High …, 2020 - ieeexplore.ieee.org

Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions
for applications such as image classification, object detection, speech recognition, and so …

被引用次数：57 相关文章所有 2 个版本

高级搜索

QQ 群

Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

DDLBench: towards a scalable benchmarking infrastructure for distributed deep learning

Parameter box: High performance parameter servers for efficient distributed deep neural network training

LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

COMET: A comprehensive cluster design methodology for distributed deep learning training

Eflops: Algorithm and system co-design for a high performance distributed training platform

相关搜索

引用