A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Flexgen: High-throughput generative inference of large language models with a single gpu

Y Sheng, L Zheng, B Yuan, Z Li… - International …, 2023 - proceedings.mlr.press
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

J Kim, JH Lee, S Kim, J Park, KM Yoo… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models (LLMs) face the challenges in fine-tuning and deployment due to
their high memory demands and computational costs. While parameter-efficient fine-tuning …

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

G Park, B Park, M Kim, S Lee, J Kim, B Kwon… - arXiv preprint arXiv …, 2022 - arxiv.org
The recent advancements in self-supervised learning, combined with the Transformer
architecture, have enabled natural language processing (NLP) to achieve remarkably low …

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning

H Guo, P Greengard, EP Xing, Y Kim - arXiv preprint arXiv:2311.12023, 2023 - arxiv.org
We propose a simple approach for memory-efficient adaptation of pretrained language
models. Our approach uses an iterative algorithm to decompose each pretrained matrix into …

A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arXiv preprint arXiv:2401.15347, 2024 - arxiv.org
How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

NOLA: networks as linear combination of low rank random basis

SA Koohpayegani, KL Navaneet… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently gained popularity due to their impressive few-
shot performance across various downstream tasks. However, fine-tuning all parameters …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

Hexgen: Generative inference of foundation model over heterogeneous decentralized environment

Y Jiang, R Yan, X Yao, B Chen, B Yuan - arXiv preprint arXiv:2311.11514, 2023 - arxiv.org
Serving foundation model inference is a pivotal component of contemporary AI applications,
where this service is usually hosted in a centralized data center on a group of homogeneous …

Enhancing computation efficiency in large language models through weight and activation quantization

J Lee, M Kim, S Baek, SJ Hwang, W Sung… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) are proficient in natural language processing tasks, but
their deployment is often restricted by extensive parameter sizes and computational …