A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Towards unified deep image deraining: A survey and a new benchmark

X Chen, J Pan, J Dong, J Tang - arXiv preprint arXiv:2310.03535, 2023 - arxiv.org
Recent years have witnessed significant advances in image deraining due to the kinds of
effective image priors and deep learning models. As each deraining approach has …

Llmlingua: Compressing prompts for accelerated inference of large language models

H Jiang, Q Wu, CY Lin, Y Yang, L Qiu - arXiv preprint arXiv:2310.05736, 2023 - arxiv.org
Large language models (LLMs) have been applied in various applications due to their
astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) …

Evo-vit: Slow-fast token evolution for dynamic vision transformer

Y Xu, Z Zhang, M Zhang, K Sheng, K Li… - Proceedings of the …, 2022 - ojs.aaai.org
Vision transformers (ViTs) have recently received explosive popularity, but the huge
computational cost is still a severe issue. Since the computation complexity of ViT is …

Less is more: Focus attention for efficient detr

D Zheng, W Dong, H Hu, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
DETR-like models have significantly boosted the performance of detectors and even
outperformed classical convolutional models. However, all tokens are treated equally …

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

E Kurtic, D Campos, T Nguyen, E Frantar… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer

X Chen, Z Liu, H Tang, L Yi… - Proceedings of the …, 2023 - openaccess.thecvf.com
High-resolution images enable neural networks to learn richer visual representations.
However, this improved performance comes at the cost of growing computational …

Model tells you what to discard: Adaptive kv cache compression for llms

S Ge, Y Zhang, L Liu, M Zhang, J Han, J Gao - arXiv preprint arXiv …, 2023 - arxiv.org
In this study, we introduce adaptive KV cache compression, a plug-and-play method that
reduces the memory footprint of generative inference for Large Language Models (LLMs) …

Dynamic context pruning for efficient and interpretable autoregressive transformers

S Anagnostidis, D Pavllo, L Biggio… - Advances in …, 2024 - proceedings.neurips.cc
Abstract Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their computational cost …