Ring attention with blockwise transformers for near-infinite context

H Liu, M Zaharia, P Abbeel - arXiv preprint arXiv:2310.01889, 2023 - arxiv.org
Transformers have emerged as the architecture of choice for many state-of-the-art AI
models, showcasing exceptional performance across a wide range of AI applications …

Communication-efficient federated learning with compensated overlap-fedavg

Y Zhou, Q Ye, J Lv - IEEE Transactions on Parallel and …, 2021 - ieeexplore.ieee.org
While petabytes of data are generated each day by a number of independent computing
devices, only a few of them can be finally collected and used for deep learning (DL) due to …

Graphq: Scalable pim-based graph processing

Y Zhuo, C Wang, M Zhang, R Wang, D Niu… - Proceedings of the …, 2019 - dl.acm.org
Processing-In-Memory (PIM) architectures based on recent technology advances (eg,
Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing …

Overlap communication with dependent computation via decomposition in large deep learning models

S Wang, J Wei, A Sabne, A Davis, B Ilbeyi… - Proceedings of the 28th …, 2022 - dl.acm.org
Large deep learning models have shown great potential with state-of-the-art results in many
tasks. However, running these large models is quite challenging on an accelerator (GPU or …

Optimizing bandwidth limited problems using one-sided communication and overlap

C Bell, D Bonachea, R Nishtala… - Proceedings 20th IEEE …, 2006 - ieeexplore.ieee.org
This paper demonstrates the one-sided communication used in languages like UPC can
provide a significant performance advantage for bandwidth-limited applications. This is …

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

JC Sancho, KJ Barker, DJ Kerbyson… - Proceedings of the 2006 …, 2006 - dl.acm.org
The design and implementation of a high performance communication network are critical
factors in determining the performance and cost-effectiveness of a largescale computing …

Flux: Fast software-based communication overlap on gpus through kernel fusion

LW Chang, W Bao, Q Hou, C Jiang, N Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large deep learning models have demonstrated strong ability to solve many tasks across a
wide range of applications. Those large models typically require training and inference to be …

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

G Wang, C Zhang, Z Shen, A Li, O Ruwase - arXiv preprint arXiv …, 2024 - arxiv.org
Given the popularity of generative AI, Large Language Models (LLMs) often consume
hundreds or thousands of GPUs for parallelizing and accelerating the training process …

Communication-sensitive static dataflow for parallel message passing applications

G Bronevetsky - 2009 International Symposium on Code …, 2009 - ieeexplore.ieee.org
Message passing is a very popular style of parallel programming, used in a wide variety of
applications and supported by many APIs, such as BSD sockets, MPI and PVM. Its …

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …