Hawq-v3: Dyadic neural network quantization

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：176 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier

Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

被引用次数：46 相关文章所有 6 个版本

[PDF] neurips.cc

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in Neural …, 2022 - proceedings.neurips.cc

Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

被引用次数：813 相关文章所有 6 个版本

[PDF] neurips.cc

Quip: 2-bit quantization of large language models with guarantees

J Chee, Y Cai, V Kuleshov… - Advances in Neural …, 2024 - proceedings.neurips.cc

This work studies post-training parameter quantization in large language models (LLMs).
We introduce quantization with incoherence processing (QuIP), a new method based on the …

被引用次数：102 相关文章所有 6 个版本

[PDF] neurips.cc

Optimal brain compression: A framework for accurate post-training quantization and pruning

E Frantar, D Alistarh - Advances in Neural Information …, 2022 - proceedings.neurips.cc

We consider the problem of model compression for deep neural networks (DNNs) in the
challenging one-shot/post-training setting, in which we are given an accurate trained model …

被引用次数：185 相关文章所有 5 个版本

[PDF] arxiv.org

A survey of quantization methods for efficient neural network inference

A Gholami, S Kim, Z Dong, Z Yao… - Low-Power Computer …, 2022 - taylorfrancis.com

This chapter provides approaches to the problem of quantizing the numerical values in deep
Neural Network computations, covering the advantages/disadvantages of current methods …

被引用次数：1230 相关文章所有 4 个版本

[PDF] mlr.press

I-bert: Integer-only bert quantization

S Kim, A Gholami, Z Yao… - … on machine learning, 2021 - proceedings.mlr.press

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results
in many Natural Language Processing tasks. However, their memory footprint, inference …

被引用次数：348 相关文章所有 7 个版本

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

被引用次数：133 相关文章所有 4 个版本

[PDF] arxiv.org

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

被引用次数：81 相关文章所有 4 个版本

[PDF] iacr.org

Bolt: Privacy-preserving, accurate and efficient inference for transformers

Q Pang, J Zhu, H Möllering, W Zheng… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org

The advent of transformers has brought about significant advancements in traditional
machine learning tasks. However, their pervasive deployment has raised concerns about …

被引用次数：21 相关文章所有 2 个版本

高级搜索

QQ 群