BurstZ: a bandwidth-efficient scientific computing accelerator platform for large-scale data

Q Zhou, C Chu, NS Kumar, P Kousha… - 2021 IEEE …, 2021 - ieeexplore.ieee.org

While the memory bandwidth of accelerators such as GPU has significantly improved over
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …

被引用次数：37 相关文章所有 4 个版本

[PDF] nsf.gov

Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication

Q Zhou, Q Anthony, L Xu, A Shafi… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out
data-parallel training of Deep Learning (DL) models. It shards the model parameters …

被引用次数：10 相关文章所有 3 个版本

[PDF] osti.gov

Zhw: A numerical codec for big data scientific computation

M Barrow, Z Wu, S Lloyd, M Gokhale… - … Conference on Field …, 2022 - ieeexplore.ieee.org

Distributed big data in scientific computing presents a major I/O performance bottleneck
when exploiting data paral-lelism. Consumer and producer compute nodes are often …

被引用次数：8 相关文章所有 4 个版本

[PDF] sciencedirect.com

DE-ZFP: An FPGA implementation of a modified ZFP compression/decompression algorithm

M Habboush, AH El-Maleh, MES Elrabaa… - Microprocessors and …, 2022 - Elsevier

In this work, we present DE-ZFP: a hardware implementation of modified ZFP compression
and decompression algorithms on a Field Programmable Gate Array (FPGA). It can be used …

被引用次数：11 相关文章所有 2 个版本

[PDF] mdpi.com

Mobilenets can be lossily compressed: Neural network compression for embedded accelerators

SM Lim, SW Jun - Electronics, 2022 - mdpi.com

Although neural network quantization is an imperative technology for the computation and
memory efficiency of embedded neural network accelerators, simple post-training …

被引用次数：11 相关文章所有 3 个版本

[PDF] researchgate.net

ZFP: A compressed array representation for numerical computations

P Lindstrom, J Hittinger, J Diffenderfer… - … Journal of High …, 2025 - journals.sagepub.com

HPC trends favor algorithms and implementations that reduce data motion relative to
FLOPS. We investigate the use of lossy compressed data arrays in place of traditional IEEE …

Extending the problem data size for GPU simulation beyond the GPU memory storage with LRnLA algorithms

A Perepelkina, V Levchenko… - Journal of Physics …, 2021 - iopscience.iop.org

To use the CPU RAM to store the data of a GPU simulation, the data exchange between
CPU and GPU is required. With LRnLA algorithms, the computation without data exchange …

被引用次数：9 相关文章所有 5 个版本

[PDF] nsf.gov

Accelerating broadcast communication with gpu compression for deep learning workloads

Q Zhou, Q Anthony, A Shafi… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org

With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on
multiple GPU nodes to run distributed training. Large message communication of GPU data …

被引用次数：7 相关文章所有 3 个版本

[PDF] acm.org

CEAZ: accelerating parallel I/O via hardware-algorithm co-designed adaptive lossy compression

C Zhang, S Jin, T Geng, J Tian, A Li, D Tao - Proceedings of the 36th …, 2022 - dl.acm.org

As HPC systems continue to grow to exascale, the amount of data that needs to be saved or
transmitted is exploding. To this end, many previous works have studied using error …

被引用次数：8 相关文章所有 8 个版本

PAS: A new powerful and simple quantum computing simulator

H Bian, J Huang, J Tang, R Dong… - Software: Practice and …, 2023 - Wiley Online Library

In recent years, many researchers have been using CPU for quantum computing simulation.
However, in reality, the simulation efficiency of the large‐scale simulator is low on a single …

被引用次数：5 相关文章

高级搜索

QQ 群