Transformations to parallel codes for communication-computation overlap

H Liu, M Zaharia, P Abbeel - arXiv preprint arXiv:2310.01889, 2023 - arxiv.org

Transformers have emerged as the architecture of choice for many state-of-the-art AI
models, showcasing exceptional performance across a wide range of AI applications …

被引用次数：126 相关文章所有 4 个版本

[PDF] arxiv.org

Communication-efficient federated learning with compensated overlap-fedavg

Y Zhou, Q Ye, J Lv - IEEE Transactions on Parallel and …, 2021 - ieeexplore.ieee.org

While petabytes of data are generated each day by a number of independent computing
devices, only a few of them can be finally collected and used for deep learning (DL) due to …

被引用次数：168 相关文章所有 6 个版本

[PDF] acm.org

Graphq: Scalable pim-based graph processing

Y Zhuo, C Wang, M Zhang, R Wang, D Niu… - Proceedings of the …, 2019 - dl.acm.org

Processing-In-Memory (PIM) architectures based on recent technology advances (eg,
Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing …

被引用次数：171 相关文章所有 4 个版本

[PDF] acm.org

Overlap communication with dependent computation via decomposition in large deep learning models

S Wang, J Wei, A Sabne, A Davis, B Ilbeyi… - Proceedings of the 28th …, 2022 - dl.acm.org

Large deep learning models have shown great potential with state-of-the-art results in many
tasks. However, running these large models is quite challenging on an accelerator (GPU or …

被引用次数：55 相关文章所有 2 个版本

[PDF] escholarship.org

Optimizing bandwidth limited problems using one-sided communication and overlap

C Bell, D Bonachea, R Nishtala… - Proceedings 20th IEEE …, 2006 - ieeexplore.ieee.org

This paper demonstrates the one-sided communication used in languages like UPC can
provide a significant performance advantage for bandwidth-limited applications. This is …

被引用次数：199 相关文章所有 21 个版本

[PDF] academia.edu

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

JC Sancho, KJ Barker, DJ Kerbyson… - Proceedings of the 2006 …, 2006 - dl.acm.org

The design and implementation of a high performance communication network are critical
factors in determining the performance and cost-effectiveness of a largescale computing …

被引用次数：150 相关文章所有 12 个版本

[PDF] arxiv.org

Flux: Fast software-based communication overlap on gpus through kernel fusion

LW Chang, W Bao, Q Hou, C Jiang, N Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Large deep learning models have demonstrated strong ability to solve many tasks across a
wide range of applications. Those large models typically require training and inference to be …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

G Wang, C Zhang, Z Shen, A Li, O Ruwase - arXiv preprint arXiv …, 2024 - arxiv.org

Given the popularity of generative AI, Large Language Models (LLMs) often consume
hundreds or thousands of GPUs for parallelizing and accelerating the training process …

被引用次数：2 相关文章所有 3 个版本

[PDF] psu.edu

Communication-sensitive static dataflow for parallel message passing applications

G Bronevetsky - 2009 International Symposium on Code …, 2009 - ieeexplore.ieee.org

Message passing is a very popular style of parallel programming, used in a wide variety of
applications and supported by many APIs, such as BSD sockets, MPI and PVM. Its …

被引用次数：102 相关文章所有 8 个版本

[PDF] arxiv.org

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …

被引用次数：1 相关文章所有 11 个版本

高级搜索

QQ 群