Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical...

Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction

RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org

Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …

被引用次数：167 相关文章所有 6 个版本

[PDF] acm.org

An evaluation of the CORAL interconnects

C Zimmer, S Atchley, R Pankajakshan… - Proceedings of the …, 2019 - dl.acm.org

The US Department of Energy deployed the Summit and Sierra supercomputers with the
latest state-of-the-art network interconnect technology in 2018 and both systems entered …

被引用次数：30 相关文章所有 2 个版本

[PDF] academia.edu

Hierarchical redesign of classic MPI reduction algorithms

K Hasanov, A Lastovetsky - The Journal of Supercomputing, 2017 - Springer

Optimization of MPI collective communication operations has been an active research topic
since the advent of MPI in 1990s. Many general and architecture-specific collective …

被引用次数：38 相关文章所有 9 个版本

[PDF] researchgate.net

Efficient process arrival pattern aware collective communication for deep learning

P Alizadeh, A Sojoodi, Y Hassan Temucin… - Proceedings of the 29th …, 2022 - dl.acm.org

MPI collective communication operations are used extensively in parallel applications. As
such, researchers have been investigating how to improve their performance and scalability …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Tascade: Hardware support for atomic-free, asynchronous and efficient reduction trees

M Orenes-Vera, E Tureci, D Wentzlaff… - arXiv preprint arXiv …, 2023 - arxiv.org

As system parallelism at chip-and server-level increases, challenges that arose with network-
level systems a decade ago, are now being encountered with these massively parallel …

被引用次数：2 相关文章所有 2 个版本

[PDF] researchgate.net

Energy-efficient collective reduce and allreduce operations on distributed GPUs

L Oden, B Klenk, H Fröning - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org

GPUs gain high popularity in High Performance Computing, due to their massive parallelism
and high performance per Watt. Despite their popularity, data transfer between multiple …

被引用次数：30 相关文章所有 7 个版本

[PDF] russianscdays.org

Designing a Parallel Programs on the Base of the Conception of Q-Determinant

V Aleeva - … : 4th Russian Supercomputing Days, RuSCDays 2018 …, 2019 - Springer

The paper describes a design method of parallel programs for numerical algorithms based
on their representation in the form of Q-determinant. The result of the method is Q-effective …

被引用次数：13 相关文章所有 5 个版本

High-Performance Computing Using Application of Q-determinant of Numerical Algorithms

VN Aleeva, RZ Aleev - 2018 Global Smart Industry Conference …, 2018 - ieeexplore.ieee.org

The conception of Q-determinant is one of the approaches to parallelizing numerical
algorithms. The basic notion of the conception is Q-determinant of the algorithm. Here Q is …

被引用次数：10 相关文章所有 2 个版本

Hierarchical optimization of MPI reduce algorithms

K Hasanov, A Lastovetsky - … , PaCT 2015, Petrozavodsk, Russia, August 31 …, 2015 - Springer

Optimization of MPI collective communication operations has been an active research topic
since the advent of MPI in 1990s. Many general and architecture-specific collective …

被引用次数：11 相关文章所有 4 个版本

Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives

MG Venkata, V Petrov, S Lebedev… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org

Unified Collective Communication (UCC) is an API and library implementation of collective
communication operations. The goal of UCC is to provide a unified API and library serving …

高级搜索

QQ 群