Evaluating scalability bottlenecks by workload extrapolation

R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org
Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

R Shi, S Potluri, K Hamidouche… - … Conference on High …, 2014 - ieeexplore.ieee.org
Increasing number of MPI applications are being ported to take advantage of the compute
power offered by GPUs. Data movement on GPU clusters continues to be the major …

[PDF][PDF] Software Aging and Multifractality of Memory Resources.

M Shereshevsky, J Crowell, B Cukic, V Gandikota… - DSN, 2003 - scholar.archive.org
We investigate the dynamics of monitored memory resource utilizations in an operating
system under stress using quantitative methods of fractal analysis. In the experiments, we …

Automatic irregularity-aware fine-grained workload partitioning on integrated architectures

F Zhang, J Zhai, B Wu, B He, W Chen… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
The integrated architecture that features both CPU and GPU on the same die is an emerging
and promising architecture for fine-grained CPU-GPU collaboration. However, the …

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

JM Hashmi, CH Chu, S Chakraborty… - Journal of Parallel and …, 2020 - Elsevier
This paper addresses the challenges of MPI derived datatype processing and proposes
FALCON-X—A Fast and Low-overhead Communication framework for optimized zero-copy …

High performance MPI datatype support with user-mode memory registration: Challenges, designs, and benefits

M Li, H Subramoni, K Hamidouche… - … on Cluster Computing, 2015 - ieeexplore.ieee.org
Noncontiguous data communication has been heavily adopted in scientific applications,
especially for those written with MPI. Common strategies to handle noncontiguous data, like …

Dynamic kernel fusion for bulk non-contiguous data transfer on GPU clusters

CH Chu, KS Khorassani, Q Zhou… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
In the last decade, many scientific applications have been significantly accelerated by large-
scale GPU systems. However, the movement of non-contiguous GPU-resident data is one of …

Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA

C Guo, H Chen, F Zhang, C Li - … of the 48th International Conference on …, 2019 - dl.acm.org
In data management systems, query processing on GPUs or distributed clusters have
proven to be an effective method for high efficiency. However, the high PCIe data transfer …

High-performance adaptive MPI derived datatype communication for modern Multi-GPU systems

CH Chu, JM Hashmi, KS Khorassani… - 2019 IEEE 26th …, 2019 - ieeexplore.ieee.org
The recent advent of the NVLink interconnect and Peripheral Component Interconnect
express (PCIe) switch has resulted in the creation of extremely dense Graphics Processing …

Network assisted non-contiguous transfers for GPU-aware MPI libraries

KK Suresh, KS Khorassani, CC Chen… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
The importance of GPUs in accelerating HPC applications is evident by the fact that a large
number of super-computing clusters are GPU-enabled. Many of these HPC applications use …