Llm-qat: Data-free quantization aware training for large language models

Z Liu, B Oguz, C Zhao, E Chang, P Stock… - arXiv preprint arXiv …, 2023 - arxiv.org
Several post-training quantization methods have been applied to large language models
(LLMs), and have been shown to perform well down to 8-bits. We find that these methods …

Omniquant: Omnidirectionally calibrated quantization for large language models

W Shao, M Chen, Z Zhang, P Xu, L Zhao, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have revolutionized natural language processing tasks.
However, their practical deployment is hindered by their immense memory and computation …

A comprehensive survey on model quantization for deep neural networks in image classification

B Rokh, A Azarpeyvand, A Khanteymoori - ACM Transactions on …, 2023 - dl.acm.org
Recent advancements in machine learning achieved by Deep Neural Networks (DNNs)
have been significant. While demonstrating high accuracy, DNNs are associated with a …

Qllm: Accurate and efficient low-bitwidth quantization for large language models

J Liu, R Gong, X Wei, Z Dong, J Cai… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread
deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive …

Bit: Robustly binarized multi-distilled transformer

Z Liu, B Oguz, A Pappu, L Xiao, S Yih… - Advances in neural …, 2022 - proceedings.neurips.cc
Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine
learning, but have also grown in parameters and computational complexity, making them …

Q-detr: An efficient low-bit quantized detection transformer

S Xu, Y Li, M Lin, P Gao, G Guo… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent detection transformer (DETR) has advanced object detection, but its application
on resource-constrained devices requires massive computation and memory resources …

A survey on deep learning hardware accelerators for heterogeneous hpc platforms

C Silvano, D Ielmini, F Ferrandi, L Fiorin… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable
solution for several classes of high-performance computing (HPC) applications such as …

A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arXiv preprint arXiv:2401.15347, 2024 - arxiv.org
How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

Oscillation-free quantization for low-bit vision transformers

SY Liu, Z Liu, KT Cheng - International Conference on …, 2023 - proceedings.mlr.press
Weight oscillation is a by-product of quantization-aware training, in which quantized weights
frequently jump between two quantized levels, resulting in training instability and a sub …

Irgen: Generative modeling for image retrieval

Y Zhang, T Zhang, D Chen, Y Wang, Q Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
While generative modeling has been ubiquitous in natural language processing and
computer vision, its application to image retrieval remains unexplored. In this paper, we …