Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters

Q Zhou, C Chu, NS Kumar, P Kousha… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
While the memory bandwidth of accelerators such as GPU has significantly improved over
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …

Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication

Q Zhou, Q Anthony, L Xu, A Shafi… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out
data-parallel training of Deep Learning (DL) models. It shards the model parameters …

Zhw: A numerical codec for big data scientific computation

M Barrow, Z Wu, S Lloyd, M Gokhale… - … Conference on Field …, 2022 - ieeexplore.ieee.org
Distributed big data in scientific computing presents a major I/O performance bottleneck
when exploiting data paral-lelism. Consumer and producer compute nodes are often …

DE-ZFP: An FPGA implementation of a modified ZFP compression/decompression algorithm

M Habboush, AH El-Maleh, MES Elrabaa… - Microprocessors and …, 2022 - Elsevier
In this work, we present DE-ZFP: a hardware implementation of modified ZFP compression
and decompression algorithms on a Field Programmable Gate Array (FPGA). It can be used …

Mobilenets can be lossily compressed: Neural network compression for embedded accelerators

SM Lim, SW Jun - Electronics, 2022 - mdpi.com
Although neural network quantization is an imperative technology for the computation and
memory efficiency of embedded neural network accelerators, simple post-training …

ZFP: A compressed array representation for numerical computations

P Lindstrom, J Hittinger, J Diffenderfer… - … Journal of High …, 2025 - journals.sagepub.com
HPC trends favor algorithms and implementations that reduce data motion relative to
FLOPS. We investigate the use of lossy compressed data arrays in place of traditional IEEE …

Extending the problem data size for GPU simulation beyond the GPU memory storage with LRnLA algorithms

A Perepelkina, V Levchenko… - Journal of Physics …, 2021 - iopscience.iop.org
To use the CPU RAM to store the data of a GPU simulation, the data exchange between
CPU and GPU is required. With LRnLA algorithms, the computation without data exchange …

Accelerating broadcast communication with gpu compression for deep learning workloads

Q Zhou, Q Anthony, A Shafi… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org
With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on
multiple GPU nodes to run distributed training. Large message communication of GPU data …

CEAZ: accelerating parallel I/O via hardware-algorithm co-designed adaptive lossy compression

C Zhang, S Jin, T Geng, J Tian, A Li, D Tao - Proceedings of the 36th …, 2022 - dl.acm.org
As HPC systems continue to grow to exascale, the amount of data that needs to be saved or
transmitted is exploding. To this end, many previous works have studied using error …

PAS: A new powerful and simple quantum computing simulator

H Bian, J Huang, J Tang, R Dong… - Software: Practice and …, 2023 - Wiley Online Library
In recent years, many researchers have been using CPU for quantum computing simulation.
However, in reality, the simulation efficiency of the large‐scale simulator is low on a single …