Deferred continuous batching in resource-efficient large language model serving

文章

学术资源搜索

获得 4 条结果（用时0.02秒）

我的图书馆

Deferred continuous batching in resource-efficient large language model serving

在引用文章中搜索

[PDF] arxiv.org

On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arXiv preprint arXiv …, 2024 - arxiv.org

The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

被引用次数：16 相关文章所有 2 个版本

[PDF] osf.io

[PDF][PDF] Gpt-neo with lora for better medical knowledge performance on multimedqa dataset

J Blanco, C Lambert, O Thompson - 2024 - osf.io

Abstract The integration of Low-Rank Adaptation (LoRA) with the GPT-Neo model
significantly enhances its performance in medical knowledge tasks by leveraging the …

被引用次数：46 相关文章所有 4 个版本

[PDF] arxiv.org

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Z Yang, Y Yang, C Zhao, Q Guo, W He, W Ji - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid growth in the number of large language model (LLM) users, it is difficult for
bandwidth-constrained cloud servers to simultaneously process massive LLM services in …

被引用次数：7 相关文章所有 2 个版本

[PDF] ethz.ch

Understanding GPU Architecture Implications on LLM Serving Workloads

Z Zhang - 2024 - research-collection.ethz.ch

Large Language Models (LLM) has become a promising piece of new technology. However,
the power of LLMs can only be unleashed with a lot of computation. In this work, we firstly …

高级搜索

QQ 群

Deferred continuous batching in resource-efficient large language model serving

On-device language models: A comprehensive review

[PDF][PDF] Gpt-neo with lora for better medical knowledge performance on multimedqa dataset

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Understanding GPU Architecture Implications on LLM Serving Workloads

引用