Offloading machine learning to programmable data planes: A systematic survey

R Parizotto, BL Coelho, DC Nunes, I Haque… - ACM Computing …, 2023 - dl.acm.org
The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …

Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction

RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org
Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …

High performance interconnect network for Tianhe system

XK Liao, ZB Pang, KF Wang, YT Lu, M Xie, J Xia… - Journal of Computer …, 2015 - Springer
In this paper, we present the Tianhe-2 interconnect network and message passing services.
We describe the architecture of the router and network interface chips, and highlight a set of …

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

RL Graham, L Levi, D Burredy, G Bloch… - … Conference, ISC High …, 2020 - Springer
This paper describes the new hardware-based streaming-aggregation capability added to
Mellanox's Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand …

A RISC-V in-network accelerator for flexible high-performance low-power packet processing

S Di Girolamo, A Kurth, A Calotoiu… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
The capacity of offloading data and control tasks to the network is becoming increasingly
important, especially if we consider the faster growth of network speed when compared to …

Energy, memory, and runtime tradeoffs for implementing collective communication operations

T Hoefler, D Moor - Supercomputing frontiers and innovations, 2014 - superfri.susu.ru
Collective operations are among the most important communication operations in shared-
and distributed-memory parallel applications. In this paper, we analyze the tradeoffs …

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

K Kandalla, H Subramoni, K Tomko… - … Science-Research and …, 2011 - Springer
Three-dimensional FFT is an important component of many scientific computing applications
ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely …

The TH Express high performance interconnect networks

Z Pang, M Xie, J Zhang, Y Zheng, G Wang… - Frontiers of Computer …, 2014 - Springer
Interconnection network plays an important role in scalable high performance computer
(HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to …

Cheetah: A framework for scalable hierarchical collective operations

R Graham, MG Venkata, J Ladd… - 2011 11th IEEE/ACM …, 2011 - ieeexplore.ieee.org
Collective communication operations, used by many scientific applications, tend to limit
overall parallel application performance and scalability. Computer systems are becoming …

Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities

RL Graham, S Poole, P Shamis, G Bloch… - … on Parallel & …, 2010 - ieeexplore.ieee.org
This paper explores the computation and communication overlap capabilities enabled by
the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface …