Microscaling data formats for deep learning

BD Rouhani, R Zhao, A More, M Hall… - arXiv preprint arXiv …, 2023 - arxiv.org
Narrow bit-width data formats are key to reducing the computational and storage costs of
modern deep learning applications. This paper evaluates Microscaling (MX) data formats …

Rptq: Reorder-based post-training quantization for large language models

Z Yuan, L Niu, J Liu, W Liu, X Wang, Y Shang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale language models (LLMs) have demonstrated impressive performance, but their
deployment presents challenges due to their significant memory usage. This issue can be …

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

C Sakr, S Dai, R Venkatesan… - International …, 2022 - proceedings.mlr.press
Data clipping is crucial in reducing noise in quantization operations and improving the
achievable accuracy of quantization-aware training (QAT). Current practices rely on …

With shared microexponents, a little shifting goes a long way

B Darvish Rouhani, R Zhao, V Elango… - Proceedings of the 50th …, 2023 - dl.acm.org
This paper introduces Block Data Representations (BDR), a framework for exploring and
evaluating a wide spectrum of narrow-precision formats for deep learning. It enables …

Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control

CY Yen, S Abbasloo, HJ Chao - … of the ACM SIGCOMM 2023 Conference, 2023 - dl.acm.org
In this work, for the first time, we demonstrate that computers can automatically learn from
observing the heuristic efforts of the last four decades, stand on the shoulders of the existing …

A 95.6-TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm

B Keller, R Venkatesan, S Dai, SG Tell… - IEEE Journal of Solid …, 2023 - ieeexplore.ieee.org
The energy efficiency of deep neural network (DNN) inference can be improved with custom
accelerators. DNN inference accelerators often employ specialized hardware techniques to …

Daq: Channel-wise distribution-aware quantization for deep image super-resolution networks

C Hong, H Kim, S Baik, J Oh… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Since the resurgence of deep neural networks (DNNs), image super-resolution (SR) has
recently seen a huge progress in improving the quality of low resolution images, however at …

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

M Neseem, C McCullough, R Hsin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Low-precision quantization is recognized for its efficacy in neural network optimization. Our
analysis reveals that non-quantized elementwise operations which are prevalent in layers …

Pareto-optimal quantized resnet is mostly 4-bit

AA Abdolrashidi, L Wang, S Agrawal… - Proceedings of the …, 2021 - openaccess.thecvf.com
Quantization has become a popular technique to compress neural networks and reduce
compute cost, but most prior work focuses on studying quantization without changing the …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …