Layer-Condensed KV Cache for Efficient Inference of Large Language Models

H Wu, K Tu - arXiv preprint arXiv:2405.10637, 2024 - arxiv.org
Huge memory consumption has been a major bottleneck for deploying high-throughput
large language models in real-world applications. In addition to the large number of …

Towards understanding how attention mechanism works in deep learning

T Ruan, S Zhang - arXiv preprint arXiv:2412.18288, 2024 - arxiv.org
Attention mechanism has been extensively integrated within mainstream neural network
architectures, such as Transformers and graph attention networks. Yet, its underlying …