Wkvquant: Quantizing weight and key/value cache for large language models gains more

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu

Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

被引用次数：212 相关文章所有 2 个版本

[PDF] arxiv.org

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

被引用次数：50 相关文章所有 2 个版本

[PDF] arxiv.org

A survey of low-bit large language models: Basics, systems, and algorithms

R Gong, Y Ding, Z Wang, C Lv, X Zheng, J Du… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

I-llm: Efficient integer-only inference for fully-quantized low-bit large language models

X Hu, Y Cheng, D Yang, Z Yuan, J Yu, C Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of
large language models (LLMs). Nonetheless, existing works still necessitate a considerable …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

Y Shang, B Xu, W Kang, M Cai, Y Li, Z Wen… - arXiv preprint arXiv …, 2024 - arxiv.org

Advancements in Large Language Models (LLMs) inspire various strategies for integrating
video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

H Sun, Z Chen, X Yang, Y Tian, B Chen - arXiv preprint arXiv:2404.11912, 2024 - arxiv.org

With large language models (LLMs) widely deployed in long content generation recently,
there has emerged an increasing demand for efficient long-sequence inference support …

被引用次数：22 相关文章所有 3 个版本

A survey of small language models

C Van Nguyen, X Shen, R Aponte, Y Xia… - arXiv preprint arXiv …, 2024 - arxiv.org

Small Language Models (SLMs) have become increasingly important due to their efficiency
and performance to perform various language tasks with minimal computational resources …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Y Fu - arXiv preprint arXiv:2405.08944, 2024 - arxiv.org

Transformer-based long context generative models power emerging AI applications like
hour-long video understanding and project-level coding agent. Deploying long context …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

Y Wang, T Yang, X Liang, G Wang, H Lu, X Zhe… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper provides a comprehensive overview of the principles, challenges, and
methodologies associated with quantizing large-scale neural network models. As neural …

高级搜索

QQ 群