Pareto-optimal quantized resnet is mostly 4-bit

R Pope, S Douglas, A Chowdhery… - Proceedings of …, 2023 - proceedings.mlsys.org

We study the problem of efficient generative inference for Transformer models, in one of its
most challenging settings: large deep models, with tight latency targets and long sequence …

被引用次数：342 相关文章所有 4 个版本

[PDF] arxiv.org

Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation

M Li, T Shi, C Ziems, MY Kan, NF Chen, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Annotated data plays a critical role in Natural Language Processing (NLP) in training
models and evaluating their performance. Given recent developments in Large Language …

被引用次数：51 相关文章所有 7 个版本

[PDF] mlr.press

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

C Sakr, S Dai, R Venkatesan… - International …, 2022 - proceedings.mlr.press

Data clipping is crucial in reducing noise in quantization operations and improving the
achievable accuracy of quantization-aware training (QAT). Current practices rely on …

被引用次数：40 相关文章所有 3 个版本

[PDF] thecvf.com

Pokebnn: A binary pursuit of lightweight accuracy

Y Zhang, Z Zhang, L Lew - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com

Abstract Optimization of Top-1 ImageNet promotes enormous networks that may be
impractical in inference settings. Binary neural networks (BNNs) have the potential to …

被引用次数：61 相关文章所有 8 个版本

[PDF] mlr.press

Understanding int4 quantization for language models: latency speedup, composability, and failure cases

X Wu, C Li, RY Aminabadi, Z Yao… - … Conference on Machine …, 2023 - proceedings.mlr.press

Improving the deployment efficiency of transformer-based language models has been
challenging given their high computation and memory cost. While INT8 quantization has …

被引用次数：19 相关文章所有 4 个版本

[PDF] arxiv.org

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases

X Wu, C Li, RY Aminabadi, Z Yao, Y He - arXiv preprint arXiv:2301.12017, 2023 - arxiv.org

Improving the deployment efficiency of transformer-based language models has been
challenging given their high computation and memory cost. While INT8 quantization has …

被引用次数：25 相关文章所有 2 个版本

[PDF] arxiv.org

4-bit conformer with native quantization aware training for speech recognition

S Ding, P Meadowlark, Y He, L Lew, S Agrawal… - arXiv preprint arXiv …, 2022 - arxiv.org

Reducing the latency and model size has always been a significant research problem for
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …

被引用次数：28 相关文章所有 7 个版本

[PDF] thecvf.com

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

M Neseem, C McCullough, R Hsin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Low-precision quantization is recognized for its efficacy in neural network optimization. Our
analysis reveals that non-quantized elementwise operations which are prevalent in layers …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

Distill or annotate? cost-efficient fine-tuning of compact models

J Kang, W Xu, A Ritter - arXiv preprint arXiv:2305.01645, 2023 - arxiv.org

Fine-tuning large models is highly effective, however, inference can be expensive and
produces carbon emissions. Knowledge distillation has been shown to be a practical …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

2-bit conformer quantization for automatic speech recognition

O Rybakov, P Meadowlark, S Ding, D Qiu, J Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Large speech models are rapidly gaining traction in research community. As a result, model
compression has become an important topic, so that these models can fit in memory and be …

被引用次数：9 相关文章所有 4 个版本

高级搜索

QQ 群