High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand...

D Pekurovsky - SIAM Journal on Scientific Computing, 2012 - SIAM

Fourier and related transforms are a family of algorithms widely employed in diverse areas
of computational science, notoriously difficult to scale on high-performance parallel …

被引用次数：325 相关文章所有 8 个版本

Bluesmpi: Efficient mpi non-blocking alltoall offloading designs on modern bluefield smart nics

M Bayatpour, N Sarkauskas, H Subramoni… - … Conference on High …, 2021 - Springer

In the state-of-the-art production quality MPI (Message Passing Interface) libraries,
communication progress is either performed by the main thread or a separate …

被引用次数：51 相关文章所有 3 个版本

[PDF] psu.edu

Using run-time reconfiguration for fault injection in hardware prototypes

L Antoni, R Leveugle, M Feher - 17th IEEE International …, 2002 - ieeexplore.ieee.org

In this paper, a new methodology for the injection of single event upsets (SEU) in memory
elements is introduced. SEUs in memory elements can occur due to many reasons (eg …

被引用次数：181 相关文章所有 16 个版本

[PDF] arxiv.org

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

A Gholami, J Hill, D Malhotra, G Biros - arXiv preprint arXiv:1506.07933, 2015 - arxiv.org

We present a new library for parallel distributed Fast Fourier Transforms (FFT). The
importance of FFT in science and engineering and the advances in high performance …

被引用次数：73 相关文章所有 5 个版本

[PDF] sciencedirect.com

Efficient design for MPI asynchronous progress without dedicated resources

A Ruhela, H Subramoni, S Chakraborty, M Bayatpour… - Parallel Computing, 2019 - Elsevier

The overlap of computation and communication is critical for good performance of many
HPC applications. State-of-the-art designs for the asynchronous progress require specially …

被引用次数：14 相关文章所有 3 个版本

[PDF] google.com

Scalable reduction collectives with data partitioning-based multi-leader design

M Bayatpour, S Chakraborty, H Subramoni… - Proceedings of the …, 2017 - dl.acm.org

Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in
modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in …

被引用次数：49 相关文章所有 4 个版本

[PDF] amazonaws.com

[PDF][PDF] The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC

DK Panda, K Tomko, K Schulz… - … with Int'l …, 2013 - pfigshare-u-files.s3.amazonaws.com

I. OVERVIEW OF THE MVAPICH PROJECT The MVAPICH (for MPI-1) and MVAPICH2 (for
MPI-2 and MPI-3) open-source libraries [?] have been designed and developed during the …

被引用次数：70 相关文章所有 3 个版本

[PDF] ohio-state.edu

Efficient asynchronous communication progress for MPI without dedicated resources

A Ruhela, H Subramoni, S Chakraborty… - Proceedings of the 25th …, 2018 - dl.acm.org

The overlap of computation and communication is critical for good performance of many
HPC applications. State-of-the-art designs for the asynchronous progress require specially …

被引用次数：27 相关文章所有 12 个版本

Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

W Wang, Z Lai, S Li, W Liu, K Ge, Y Liu… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-
large size with negligible increases in computation. The MoE model has achieved the …

被引用次数：6 相关文章所有 2 个版本

[PDF] utk.edu

[PDF][PDF] Interim report on benchmarking FFT libraries on high performance systems

A Ayala, S Tomov, P Luszczek, S Cayrols… - … of Tennessee, ICL …, 2021 - icl.utk.edu

Abstract The Fast Fourier Transform (FFT) is used in many applications such as molecular
dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and …

被引用次数：15 相关文章所有 2 个版本

高级搜索

QQ 群