Alphatuning: Quantization-aware parameter-efficient adaptation of large-scale pre-trained...

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

被引用次数：356 相关文章所有 3 个版本

[PDF] mlr.press

Flexgen: High-throughput generative inference of large language models with a single gpu

Y Sheng, L Zheng, B Yuan, Z Li… - International …, 2023 - proceedings.mlr.press

The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …

被引用次数：211 相关文章所有 10 个版本

[PDF] neurips.cc

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

J Kim, JH Lee, S Kim, J Park, KM Yoo… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models (LLMs) face the challenges in fine-tuning and deployment due to
their high memory demands and computational costs. While parameter-efficient fine-tuning …

被引用次数：54 相关文章所有 6 个版本

[PDF] arxiv.org

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models

G Park, B Park, M Kim, S Lee, J Kim, B Kwon… - arXiv preprint arXiv …, 2022 - arxiv.org

The recent advancements in self-supervised learning, combined with the Transformer
architecture, have enabled natural language processing (NLP) to achieve remarkably low …

被引用次数：89 相关文章所有 3 个版本

[PDF] arxiv.org

Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning

H Guo, P Greengard, EP Xing, Y Kim - arXiv preprint arXiv:2311.12023, 2023 - arxiv.org

We propose a simple approach for memory-efficient adaptation of pretrained language
models. Our approach uses an iterative algorithm to decompose each pretrained matrix into …

被引用次数：22 相关文章所有 4 个版本

[PDF] arxiv.org

A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arXiv preprint arXiv:2401.15347, 2024 - arxiv.org

How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

NOLA: networks as linear combination of low rank random basis

SA Koohpayegani, KL Navaneet… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently gained popularity due to their impressive few-
shot performance across various downstream tasks. However, fine-tuning all parameters …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Hexgen: Generative inference of foundation model over heterogeneous decentralized environment

Y Jiang, R Yan, X Yao, B Chen, B Yuan - arXiv preprint arXiv:2311.11514, 2023 - arxiv.org

Serving foundation model inference is a pivotal component of contemporary AI applications,
where this service is usually hosted in a centralized data center on a group of homogeneous …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Enhancing computation efficiency in large language models through weight and activation quantization

J Lee, M Kim, S Baek, SJ Hwang, W Sung… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) are proficient in natural language processing tasks, but
their deployment is often restricted by extensive parameter sizes and computational …

被引用次数：7 相关文章所有 4 个版本

高级搜索

QQ 群