Full stack optimization of transformer inference: a survey

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier

Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

被引用次数：65 相关文章所有 6 个版本

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

被引用次数：159 相关文章所有 4 个版本

[PDF] neurips.cc

Speculative decoding with big little decoder

S Kim, K Mangalam, S Moon, J Malik… - Advances in …, 2024 - proceedings.neurips.cc

The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …

被引用次数：65 相关文章所有 5 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：67 相关文章所有 2 个版本

[PDF] arxiv.org

The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey

S Pawar, SM Tonmoy, SM Zaman, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org

The advent of Large Language Models (LLMs) represents a notable breakthrough in Natural
Language Processing (NLP), contributing to substantial progress in both text …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

Relu strikes back: Exploiting activation sparsity in large language models

I Mirzadeh, K Alizadeh, S Mehta, CC Del Mundo… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) with billions of parameters have drastically transformed AI
applications. However, their demanding computation during inference has raised significant …

被引用次数：63 相关文章所有 4 个版本

[PDF] arxiv.org

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

被引用次数：76 相关文章所有 3 个版本

[PDF] neurips.cc

Response length perception and sequence scheduling: An llm-empowered llm inference pipeline

Z Zheng, X Ren, F Xue, Y Luo… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have revolutionized the field of AI, demonstrating
unprecedented capacity across various tasks. However, the inference process for LLMs …

被引用次数：44 相关文章所有 7 个版本

[PDF] usenix.org

{Quant-LLM}: Accelerating the Serving of Large Language Models via {FP6-Centric}{Algorithm-System}{Co-Design} on Modern {GPUs}

H Xia, Z Zheng, X Wu, S Chen, Z Yao, S Youn… - 2024 USENIX Annual …, 2024 - usenix.org

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs)
and preserve the model quality consistently across varied applications. However, existing …

被引用次数：6 相关文章

[PDF] arxiv.org

Llmcad: Fast and scalable on-device large language model inference

D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative tasks, such as text generation and question answering, hold a crucial position in
the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a …

被引用次数：45 相关文章所有 2 个版本

高级搜索

QQ 群