作者
Archit Patke, Dhemath Reddy, Saurabh Jha, Christian Pinto, Haoran Qiu, Shengkun Cui, Chandra Narayanaswami, Zbigniew T Kalbarczyk, Ravishankar K Iyer
发表日期
2024/4/27
研讨会论文
International Conference on Architectural Support for Programming Languages and Operating Systems
简介
The emergence of large language models (LLMs) has introduced excessive computational demands and unique execution patterns (ie, nondeterministic execution time due to autoregressive patterns) for cloud providers. Consequently, existing LLM serving systems lead to long request queues and fail to enforce the request-serving service-level objectives (SLOs) because no effective way yet exists to translate the high-level SLOs to low-level LLM serving operations (LSOs), such as request eviction and GPU-CPU state swap. We introduce QLM, the first queue management system for multi-model LLM serving that maximizes SLO enforcement while achiev-ing high throughput and utilization on heterogeneous devices. QLM (1) handles the non-determinism of incoming requests in the waiting queue by a highly explainable Bayesian statistical approach, and (2) reorders and assigns requests to devices (model …
学术搜索中的文章
A Patke, D Reddy, S Jha, C Pinto, H Qiu, S Cui… - International Conference on Architectural Support for …, 2024