[PDF][PDF] Specinfer: Accelerating generative llm serving with speculative inference and token tree verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - arXiv preprint arXiv …, 2023 - cs.cmu.edu
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them quickly and cheaply. This paper introduces …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Beyond efficiency: A systematic survey of resource-efficient large language models

G Bai, Z Chai, C Ling, S Wang, J Lu, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated
models like OpenAI's ChatGPT, represents a significant advancement in artificial …

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - Proceedings of the 29th …, 2024 - dl.acm.org
This paper introduces SpecInfer, a system that accelerates generative large language model
(LLM) serving with tree-based speculative inference and verification. The key idea behind …

Toward sustainable genai using generation directives for carbon-friendly large language model inference

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2403.12900, 2024 - arxiv.org
The rapid advancement of Generative Artificial Intelligence (GenAI) across diverse sectors
raises significant environmental concerns, notably the carbon emissions from their cloud …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

J Duan, Z Song, X Miao, X Xi, D Lin, H Xu… - … USENIX Symposium on …, 2024 - usenix.org
Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …

Efficient training and inference: Techniques for large language models using llama

SR Cunningham, D Archambault, A Kung - Authorea Preprints, 2024 - techrxiv.org
To enhance the efficiency of language models, it would involve optimizing their training and
inference processes to reduce computational demands while maintaining high performance …