Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct...

AA Awan, K Hamidouche, JM Hashmi… - Proceedings of the 22nd …, 2017 - dl.acm.org

Availability of large data sets like ImageNet and massively parallel computation support in
modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning …

被引用次数：180 相关文章所有 5 个版本

[PDF] gatech.edu

Tofu interconnect 2: System-on-chip integration of high-performance interconnect

Y Ajima, T Inoue, S Hiramoto, S Uno… - International …, 2014 - Springer

Abstract The Tofu Interconnect 2 (Tofu2) is a system interconnect designed for the Fujitsu's
next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6 …

被引用次数：117 相关文章所有 15 个版本

[PDF] psu.edu

Using run-time reconfiguration for fault injection in hardware prototypes

L Antoni, R Leveugle, M Feher - 17th IEEE International …, 2002 - ieeexplore.ieee.org

In this paper, a new methodology for the injection of single event upsets (SEU) in memory
elements is introduced. SEUs in memory elements can occur due to many reasons (eg …

被引用次数：181 相关文章所有 16 个版本

[PDF] ict.ac.cn

High performance interconnect network for Tianhe system

XK Liao, ZB Pang, KF Wang, YT Lu, M Xie, J Xia… - Journal of Computer …, 2015 - Springer

In this paper, we present the Tianhe-2 interconnect network and message passing services.
We describe the architecture of the router and network interface chips, and highlight a set of …

被引用次数：90 相关文章所有 8 个版本

[PDF] nsf.gov

FLASH: FPGA-accelerated smart switches with GCN case study

P Haghi, W Krska, C Tan, T Geng, PH Chen… - Proceedings of the 37th …, 2023 - dl.acm.org

Some communication switches, eg, the Mellanox SHArP and those in the IBM BlueGene
clusters, are augmented to process packets at the application level with fixed-function …

被引用次数：10 相关文章所有 3 个版本

[PDF] iczhiku.com

Hierarchical test access architecture for embedded cores in an integrated circuit

D Bhattacharya - … . 16th IEEE VLSI Test Symposium (Cat. No …, 1998 - ieeexplore.ieee.org

The rapid emergence of reusable core-based designs, in the last few years, poses new
challenges to the IEEE test access standard 1149.1. Due to widespread industrial …

被引用次数：125 相关文章所有 4 个版本

[PDF] osti.gov

INCA: in-network compute assistance

W Schonbein, RE Grant, MGF Dosanjh… - Proceedings of the …, 2019 - dl.acm.org

Current proposals for in-network data processing operate on data as it streams through a
network switch or endpoint. Since compute resources must be available when data arrives …

被引用次数：29 相关文章所有 3 个版本

[PDF] researchgate.net

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

K Kandalla, H Subramoni, K Tomko… - … Science-Research and …, 2011 - Springer

Three-dimensional FFT is an important component of many scientific computing applications
ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely …

被引用次数：77 相关文章所有 7 个版本

[PDF] arxiv.org

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …

被引用次数：1 相关文章所有 11 个版本

[PDF] researchgate.net

The TH Express high performance interconnect networks

Z Pang, M Xie, J Zhang, Y Zheng, G Wang… - Frontiers of Computer …, 2014 - Springer

Interconnection network plays an important role in scalable high performance computer
(HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to …

被引用次数：59 相关文章所有 7 个版本

高级搜索

QQ 群