GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters

L Oden, H Fröning - 2013 IEEE International Conference on …, 2013 - ieeexplore.ieee.org
2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013ieeexplore.ieee.org
Modern GPUs are powerful high-core-count processors, which are no longer used solely for
graphics applications, but are also employed to accelerate computationally intensive
general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster
to process parallel programs. In fact, many recent high-performance systems in the TOP500
list are heterogeneous architectures. Despite being highly effective processing units, GPUs
on different hosts are incapable of communicating without assistance from a CPU. As a …
Modern GPUs are powerful high-core-count processors, which are no longer used solely for graphics applications, but are also employed to accelerate computationally intensive general-purpose tasks. For utmost performance, GPUs are distributed throughout the cluster to process parallel programs. In fact, many recent high-performance systems in the TOP500 list are heterogeneous architectures. Despite being highly effective processing units, GPUs on different hosts are incapable of communicating without assistance from a CPU. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Most communication libraries even require intermediate copies from GPU memory to host memory. This overhead in particular penalizes small data movements and synchronization operations, reduces efficiency and limits scalability. In this work we introduce global address spaces to facilitate direct communication between distributed GPUs without CPU involvement. Avoiding context switches and unnecessary copying dramatically reduces communication overhead. We evaluate our approach using a variety of workloads including low-level latency and bandwidth benchmarks, basic synchronization primitives like barriers, and a stencil computation as an example application. We see performance benefits of up to 2× for basic benchmarks and up to 1.67× for stencil computations.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果