Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference

BD Rouhani, R Zhao, A More, M Hall… - arXiv preprint arXiv …, 2023 - arxiv.org

Narrow bit-width data formats are key to reducing the computational and storage costs of
modern deep learning applications. This paper evaluates Microscaling (MX) data formats …

被引用次数：46 相关文章

[PDF] arxiv.org

Rptq: Reorder-based post-training quantization for large language models

Z Yuan, L Niu, J Liu, W Liu, X Wang, Y Shang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large-scale language models (LLMs) have demonstrated impressive performance, but their
deployment presents challenges due to their significant memory usage. This issue can be …

被引用次数：70 相关文章所有 3 个版本

[PDF] mlr.press

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

C Sakr, S Dai, R Venkatesan… - International …, 2022 - proceedings.mlr.press

Data clipping is crucial in reducing noise in quantization operations and improving the
achievable accuracy of quantization-aware training (QAT). Current practices rely on …

被引用次数：40 相关文章所有 3 个版本

[PDF] arxiv.org

With shared microexponents, a little shifting goes a long way

B Darvish Rouhani, R Zhao, V Elango… - Proceedings of the 50th …, 2023 - dl.acm.org

This paper introduces Block Data Representations (BDR), a framework for exploring and
evaluating a wide spectrum of narrow-precision formats for deep learning. It enables …

被引用次数：40 相关文章所有 3 个版本

Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control

CY Yen, S Abbasloo, HJ Chao - … of the ACM SIGCOMM 2023 Conference, 2023 - dl.acm.org

In this work, for the first time, we demonstrate that computers can automatically learn from
observing the heuristic efforts of the last four decades, stand on the shoulders of the existing …

被引用次数：22 相关文章所有 2 个版本

A 95.6-TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm

B Keller, R Venkatesan, S Dai, SG Tell… - IEEE Journal of Solid …, 2023 - ieeexplore.ieee.org

The energy efficiency of deep neural network (DNN) inference can be improved with custom
accelerators. DNN inference accelerators often employ specialized hardware techniques to …

被引用次数：28 相关文章所有 2 个版本

[PDF] thecvf.com

Daq: Channel-wise distribution-aware quantization for deep image super-resolution networks

C Hong, H Kim, S Baik, J Oh… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Since the resurgence of deep neural networks (DNNs), image super-resolution (SR) has
recently seen a huge progress in improving the quality of low resolution images, however at …

被引用次数：49 相关文章所有 6 个版本

[PDF] thecvf.com

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

M Neseem, C McCullough, R Hsin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Low-precision quantization is recognized for its efficacy in neural network optimization. Our
analysis reveals that non-quantized elementwise operations which are prevalent in layers …

被引用次数：1 相关文章所有 5 个版本

[PDF] thecvf.com

Pareto-optimal quantized resnet is mostly 4-bit

AA Abdolrashidi, L Wang, S Agrawal… - Proceedings of the …, 2021 - openaccess.thecvf.com

Quantization has become a popular technique to compress neural networks and reduce
compute cost, but most prior work focuses on studying quantization without changing the …

被引用次数：31 相关文章所有 8 个版本

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：25 相关文章所有 2 个版本

高级搜索

QQ 群