Protocols for fully offloaded collective operations on accelerated network adapters

Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction

RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org

Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …

被引用次数：168 相关文章所有 6 个版本

[PDF] osti.gov

Finepoints: Partitioned multithreaded MPI communication

RE Grant, MGF Dosanjh, MJ Levenhagen… - … Conference, ISC High …, 2019 - Springer

The MPI multithreading model has been historically difficult to optimize; the interface that it
provides for threads was designed as a process-level interface. This model has led to …

被引用次数：65 相关文章所有 5 个版本

[PDF] acm.org

A survey of end-system optimizations for high-speed networks

N Hanford, V Ahuja, MK Farrens, B Tierney… - ACM Computing …, 2018 - dl.acm.org

The gap is widening between the processor clock speed of end-system architectures and
network throughput capabilities. It is now physically possible to provide single-flow …

被引用次数：20 相关文章

[PDF] googleapis.com

Throttling for bandwidth imbalanced data transfers

T Schneider, KD Underwood, M Flajslik, S Sur… - US Patent …, 2020 - Google Patents

Techniques are disclosed to throttle bandwidth imbalanced data transfers. In some
examples, an example computer-implemented method may include splitting a payload of a …

被引用次数：47 相关文章所有 4 个版本

[PDF] arxiv.org

Network-accelerated non-contiguous memory transfers

S Di Girolamo, K Taranov, A Kurth… - Proceedings of the …, 2019 - dl.acm.org

Applications often communicate data that is non-contiguous in the send-or the receive-
buffer, eg, when exchanging a column of a matrix stored in row-major order. While non …

被引用次数：34 相关文章所有 29 个版本

[PDF] osti.gov

INCA: in-network compute assistance

W Schonbein, RE Grant, MGF Dosanjh… - Proceedings of the …, 2019 - dl.acm.org

Current proposals for in-network data processing operate on data as it streams through a
network switch or endpoint. Since compute resources must be available when data arrives …

被引用次数：30 相关文章所有 3 个版本

[PDF] arxiv.org

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …

被引用次数：1 相关文章所有 11 个版本

[PDF] unixer.de

Exploiting offload enabled network interfaces

S Di Girolamo, P Jolivet… - 2015 IEEE 23rd …, 2015 - ieeexplore.ieee.org

Network interface cards are one of the key components to achieve efficient parallel
performance. In the past, they have gained new functionalities such as lossless …

被引用次数：32 相关文章所有 37 个版本

Security offload using the SmartNIC, A programmable 10 Gbps ethernet NIC

G Sabin, M Rashti - 2015 National Aerospace and Electronics …, 2015 - ieeexplore.ieee.org

The SmartNIC is a User-Programmable 10GE NIC designed around industry standards to
meet the demands of high performance networking in HPC and datacenter communities …

被引用次数：21 相关文章所有 2 个版本

[PDF] osti.gov

RaDD runtimes: Radical and different distributed runtimes with smartnics

RE Grant, W Schonbein, S Levy - 2020 IEEE/ACM Fourth …, 2020 - ieeexplore.ieee.org

As network speeds increase, the overhead of processing incoming messages is becoming
onerous enough that many manufacturers now provide network interface cards (NICs) with …

被引用次数：9 相关文章所有 5 个版本

高级搜索

QQ 群