MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

A Liu, J Liu, Z Pan, Y He, G Haffari… - arXiv preprint arXiv …, 2024 - arxiv.org
A critical approach for efficiently deploying computationally demanding large language
models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of …

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

H Dong, X Yang, Z Zhang, Z Wang, Y Chi… - arXiv preprint arXiv …, 2024 - arxiv.org
Many computational factors limit broader deployment of large language models. In this
paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a …