作者
Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, Tushar Krishna
发表日期
2021/6/14
研讨会论文
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
页码范围
540-553
出版商
IEEE
简介
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator’s compute and memory for both DL computations and communication.This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint’s compute and memory resources for DL compute, which in turn reduces the …
引用总数
学术搜索中的文章
S Rashidi, M Denton, S Sridharan, S Srinivasan… - 2021 ACM/IEEE 48th Annual International Symposium …, 2021