Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

J Yuan, H Liu, S Zhong, YN Chuang, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Long context capability is a crucial competency for large language models (LLMs) as it
mitigates the human struggle to digest long-form texts. This capability enables complex task …

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

Y Shang, B Xu, W Kang, M Cai, Y Li, Z Wen… - arXiv preprint arXiv …, 2024 - arxiv.org
Advancements in Large Language Models (LLMs) inspire various strategies for integrating
video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface …

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Q Zhu, J Duan, C Chen, S Liu, X Li, G Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) now support extremely long context windows, but the
quadratic complexity of vanilla attention results in significantly long Time-to-First-Token …

Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

Y Wang, T Yang, X Liang, G Wang, H Lu, X Zhe… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper provides a comprehensive overview of the principles, challenges, and
methodologies associated with quantizing large-scale neural network models. As neural …

Buzz: Beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference

J Zhao, Z Fang, S Li, S Yang, S He - arXiv preprint arXiv:2410.23079, 2024 - arxiv.org
Large language models (LLMs) are essential in natural language processing but often
struggle with inference speed and computational efficiency, limiting real-time deployment …

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

C Fang, M Shi, R Geens, A Symons, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
The widely-used, weight-only quantized large language models (LLMs), which leverage low-
bit integer (INT) weights and retain floating-point (FP) activations, reduce storage …

A Survey on Large Language Model Acceleration based on KV Cache Management

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …

IterGen: Iterative Structured LLM Generation

S Ugare, R Gumaste, T Suresh, G Singh… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are widely used for tasks such as natural language and
code generation. Still, their outputs often suffer from issues like privacy violations, and …

QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering

Y Wang, W Li, Z Yao, T Yang - arXiv preprint arXiv:2407.03637, 2024 - arxiv.org
The matrix quantization entails representing matrix elements in a more space-efficient form
to reduce storage usage, with dequantization restoring the original matrix for use. We …

INTERPOLATING VIDEO-LLMS: TOWARD LONGER

TF MANNER - openreview.net
Advancements in Large Language Models (LLMs) inspire various strategies for integrating
video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface …