查看文章

arxiv.org 中的 [PDF]

Enabling compute-communication overlap in distributed deep learning training platforms

作者

Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, Tushar Krishna

发表日期

2021/6/14

研讨会论文

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

页码范围

540-553

出版商

IEEE

简介

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator’s compute and memory for both DL computations and communication.This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint’s compute and memory resources for DL compute, which in turn reduces the …

引用总数

被引用次数：35

20212022202320241 11 12 10

学术搜索中的文章

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan, S Srinivasan… - 2021 ACM/IEEE 48th Annual International Symposium …, 2021

被引用次数：34 相关文章所有 6 个版本

Efficient communication acceleration for next-gen scale-up deep learning training platforms*

S Rashidi, S Sridharan, S Srinivasan, M Denton… - arXiv preprint, 2020

被引用次数：1 相关文章所有 2 个版本