Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arXiv preprint arXiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

A tighter complexity analysis of sparsegpt

X Li, Y Liang, Z Shi, Z Song - arXiv preprint arXiv:2408.12151, 2024 - arxiv.org
In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …

Lazyllm: Dynamic token pruning for efficient long context llm inference

Q Fu, M Cho, T Merth, S Mehta, M Rastegari… - arXiv preprint arXiv …, 2024 - arxiv.org
The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …

Advancing the understanding of fixed point iterations in deep neural networks: A detailed analytical study

Y Ke, X Li, Y Liang, Z Shi, Z Song - arXiv preprint arXiv:2410.11279, 2024 - arxiv.org
Recent empirical studies have identified fixed point iteration phenomena in deep neural
networks, where the hidden state tends to stabilize after several layers, showing minimal …

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao, Q Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

K Yao, P Gao, L Li, Y Zhao, X Wang, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for
adapting pre-trained Large Language Models (LLMs) to downstream tasks, primarily due to …

Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving

Y Dai, R Pan, A Iyer, K Li, R Netravali - Proceedings of the ACM SIGOPS …, 2024 - dl.acm.org
Machine learning (ML) inference platforms are tasked with balancing two competing goals:
ensuring high throughput given many requests, and delivering low-latency responses to …

Layer swapping for zero-shot cross-lingual transfer in large language models

L Bandarkar, B Muller, P Yuvraj, R Hou… - arXiv preprint arXiv …, 2024 - arxiv.org
Model merging, such as model souping, is the practice of combining different models with
the same architecture together without further training. In this work, we present a model …