Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference- 学术资源搜索

Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference

S Dai, R Venkatesan, M Ren… - Proceedings of …, 2021 - proceedings.mlsys.org

S Dai, R Venkatesan, M Ren, B Zimmer, W Dally, B Khailany

Proceedings of Machine Learning and Systems, 2021•proceedings.mlsys.org

Abstract

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of (~ 16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 69% energy saving and 36% area saving over an 8-bit baseline while maintaining over 75% accuracy for ResNet50 on ImageNet. 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 28% compared to an 8-bit baseline.

proceedings.mlsys.org

展开收起

被引用次数：67 相关文章所有 5 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference

引用