Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction

RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org
Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …

An evaluation of the CORAL interconnects

C Zimmer, S Atchley, R Pankajakshan… - Proceedings of the …, 2019 - dl.acm.org
The US Department of Energy deployed the Summit and Sierra supercomputers with the
latest state-of-the-art network interconnect technology in 2018 and both systems entered …

Hierarchical redesign of classic MPI reduction algorithms

K Hasanov, A Lastovetsky - The Journal of Supercomputing, 2017 - Springer
Optimization of MPI collective communication operations has been an active research topic
since the advent of MPI in 1990s. Many general and architecture-specific collective …

Efficient process arrival pattern aware collective communication for deep learning

P Alizadeh, A Sojoodi, Y Hassan Temucin… - Proceedings of the 29th …, 2022 - dl.acm.org
MPI collective communication operations are used extensively in parallel applications. As
such, researchers have been investigating how to improve their performance and scalability …

Tascade: Hardware support for atomic-free, asynchronous and efficient reduction trees

M Orenes-Vera, E Tureci, D Wentzlaff… - arXiv preprint arXiv …, 2023 - arxiv.org
As system parallelism at chip-and server-level increases, challenges that arose with network-
level systems a decade ago, are now being encountered with these massively parallel …

Energy-efficient collective reduce and allreduce operations on distributed GPUs

L Oden, B Klenk, H Fröning - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org
GPUs gain high popularity in High Performance Computing, due to their massive parallelism
and high performance per Watt. Despite their popularity, data transfer between multiple …

Designing a Parallel Programs on the Base of the Conception of Q-Determinant

V Aleeva - … : 4th Russian Supercomputing Days, RuSCDays 2018 …, 2019 - Springer
The paper describes a design method of parallel programs for numerical algorithms based
on their representation in the form of Q-determinant. The result of the method is Q-effective …

High-Performance Computing Using Application of Q-determinant of Numerical Algorithms

VN Aleeva, RZ Aleev - 2018 Global Smart Industry Conference …, 2018 - ieeexplore.ieee.org
The conception of Q-determinant is one of the approaches to parallelizing numerical
algorithms. The basic notion of the conception is Q-determinant of the algorithm. Here Q is …

Hierarchical optimization of MPI reduce algorithms

K Hasanov, A Lastovetsky - … , PaCT 2015, Petrozavodsk, Russia, August 31 …, 2015 - Springer
Optimization of MPI collective communication operations has been an active research topic
since the advent of MPI in 1990s. Many general and architecture-specific collective …

Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives

MG Venkata, V Petrov, S Lebedev… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org
Unified Collective Communication (UCC) is an API and library implementation of collective
communication operations. The goal of UCC is to provide a unified API and library serving …