Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arXiv e …, 2024 - ui.adsabs.harvard.edu
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …