S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters

AA Awan, K Hamidouche, JM Hashmi… - Proceedings of the 22nd …, 2017 - dl.acm.org
Availability of large data sets like ImageNet and massively parallel computation support in
modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning …

Tofu interconnect 2: System-on-chip integration of high-performance interconnect

Y Ajima, T Inoue, S Hiramoto, S Uno… - International …, 2014 - Springer
Abstract The Tofu Interconnect 2 (Tofu2) is a system interconnect designed for the Fujitsu's
next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6 …

Using run-time reconfiguration for fault injection in hardware prototypes

L Antoni, R Leveugle, M Feher - 17th IEEE International …, 2002 - ieeexplore.ieee.org
In this paper, a new methodology for the injection of single event upsets (SEU) in memory
elements is introduced. SEUs in memory elements can occur due to many reasons (eg …

High performance interconnect network for Tianhe system

XK Liao, ZB Pang, KF Wang, YT Lu, M Xie, J Xia… - Journal of Computer …, 2015 - Springer
In this paper, we present the Tianhe-2 interconnect network and message passing services.
We describe the architecture of the router and network interface chips, and highlight a set of …

FLASH: FPGA-accelerated smart switches with GCN case study

P Haghi, W Krska, C Tan, T Geng, PH Chen… - Proceedings of the 37th …, 2023 - dl.acm.org
Some communication switches, eg, the Mellanox SHArP and those in the IBM BlueGene
clusters, are augmented to process packets at the application level with fixed-function …

Hierarchical test access architecture for embedded cores in an integrated circuit

D Bhattacharya - … . 16th IEEE VLSI Test Symposium (Cat. No …, 1998 - ieeexplore.ieee.org
The rapid emergence of reusable core-based designs, in the last few years, poses new
challenges to the IEEE test access standard 1149.1. Due to widespread industrial …

INCA: in-network compute assistance

W Schonbein, RE Grant, MGF Dosanjh… - Proceedings of the …, 2019 - dl.acm.org
Current proposals for in-network data processing operate on data as it streams through a
network switch or endpoint. Since compute resources must be available when data arrives …

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

K Kandalla, H Subramoni, K Tomko… - … Science-Research and …, 2011 - Springer
Three-dimensional FFT is an important component of many scientific computing applications
ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely …

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

M Khalilov, S Di Girolamo, M Chrapek… - … Conference for High …, 2024 - ieeexplore.ieee.org
In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be
interleaved to maximize the communication/computation overlap. In this scenario …

The TH Express high performance interconnect networks

Z Pang, M Xie, J Zhang, Y Zheng, G Wang… - Frontiers of Computer …, 2014 - Springer
Interconnection network plays an important role in scalable high performance computer
(HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to …