Efficiently scaling transformer inference

R Pope, S Douglas, A Chowdhery… - Proceedings of …, 2023 - proceedings.mlsys.org
We study the problem of efficient generative inference for Transformer models, in one of its
most challenging settings: large deep models, with tight latency targets and long sequence …

Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation

M Li, T Shi, C Ziems, MY Kan, NF Chen, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Annotated data plays a critical role in Natural Language Processing (NLP) in training
models and evaluating their performance. Given recent developments in Large Language …

Optimal clipping and magnitude-aware differentiation for improved quantization-aware training

C Sakr, S Dai, R Venkatesan… - International …, 2022 - proceedings.mlr.press
Data clipping is crucial in reducing noise in quantization operations and improving the
achievable accuracy of quantization-aware training (QAT). Current practices rely on …

Pokebnn: A binary pursuit of lightweight accuracy

Y Zhang, Z Zhang, L Lew - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com
Abstract Optimization of Top-1 ImageNet promotes enormous networks that may be
impractical in inference settings. Binary neural networks (BNNs) have the potential to …

Understanding int4 quantization for language models: latency speedup, composability, and failure cases

X Wu, C Li, RY Aminabadi, Z Yao… - … Conference on Machine …, 2023 - proceedings.mlr.press
Improving the deployment efficiency of transformer-based language models has been
challenging given their high computation and memory cost. While INT8 quantization has …

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases

X Wu, C Li, RY Aminabadi, Z Yao, Y He - arXiv preprint arXiv:2301.12017, 2023 - arxiv.org
Improving the deployment efficiency of transformer-based language models has been
challenging given their high computation and memory cost. While INT8 quantization has …

4-bit conformer with native quantization aware training for speech recognition

S Ding, P Meadowlark, Y He, L Lew, S Agrawal… - arXiv preprint arXiv …, 2022 - arxiv.org
Reducing the latency and model size has always been a significant research problem for
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

M Neseem, C McCullough, R Hsin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Low-precision quantization is recognized for its efficacy in neural network optimization. Our
analysis reveals that non-quantized elementwise operations which are prevalent in layers …

Distill or annotate? cost-efficient fine-tuning of compact models

J Kang, W Xu, A Ritter - arXiv preprint arXiv:2305.01645, 2023 - arxiv.org
Fine-tuning large models is highly effective, however, inference can be expensive and
produces carbon emissions. Knowledge distillation has been shown to be a practical …

2-bit conformer quantization for automatic speech recognition

O Rybakov, P Meadowlark, S Ding, D Qiu, J Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Large speech models are rapidly gaining traction in research community. As a result, model
compression has become an important topic, so that these models can fit in memory and be …