Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction

RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org
Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …

Finepoints: Partitioned multithreaded MPI communication

RE Grant, MGF Dosanjh, MJ Levenhagen… - … Conference, ISC High …, 2019 - Springer
The MPI multithreading model has been historically difficult to optimize; the interface that it
provides for threads was designed as a process-level interface. This model has led to …

A survey of end-system optimizations for high-speed networks

N Hanford, V Ahuja, MK Farrens, B Tierney… - ACM Computing …, 2018 - dl.acm.org
The gap is widening between the processor clock speed of end-system architectures and
network throughput capabilities. It is now physically possible to provide single-flow …

Throttling for bandwidth imbalanced data transfers

T Schneider, KD Underwood, M Flajslik, S Sur… - US Patent …, 2020 - Google Patents
Techniques are disclosed to throttle bandwidth imbalanced data transfers. In some
examples, an example computer-implemented method may include splitting a payload of a …

Network-accelerated non-contiguous memory transfers

S Di Girolamo, K Taranov, A Kurth… - Proceedings of the …, 2019 - dl.acm.org
Applications often communicate data that is non-contiguous in the send-or the receive-
buffer, eg, when exchanging a column of a matrix stored in row-major order. While non …

INCA: in-network compute assistance

W Schonbein, RE Grant, MGF Dosanjh… - Proceedings of the …, 2019 - dl.acm.org
Current proposals for in-network data processing operate on data as it streams through a
network switch or endpoint. Since compute resources must be available when data arrives …

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …

Exploiting offload enabled network interfaces

S Di Girolamo, P Jolivet… - 2015 IEEE 23rd …, 2015 - ieeexplore.ieee.org
Network interface cards are one of the key components to achieve efficient parallel
performance. In the past, they have gained new functionalities such as lossless …

Security offload using the SmartNIC, A programmable 10 Gbps ethernet NIC

G Sabin, M Rashti - 2015 National Aerospace and Electronics …, 2015 - ieeexplore.ieee.org
The SmartNIC is a User-Programmable 10GE NIC designed around industry standards to
meet the demands of high performance networking in HPC and datacenter communities …

RaDD runtimes: Radical and different distributed runtimes with smartnics

RE Grant, W Schonbein, S Levy - 2020 IEEE/ACM Fourth …, 2020 - ieeexplore.ieee.org
As network speeds increase, the overhead of processing incoming messages is becoming
onerous enough that many manufacturers now provide network interface cards (NICs) with …