{INFaaS}: Automated model-less inference serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org

Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] usenix.org

Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing

S Choi, S Lee, Y Kim, J Park, Y Kwon… - 2022 USENIX Annual …, 2022 - usenix.org

As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …

被引用次数：59 相关文章所有 6 个版本

[PDF] hkust.edu.hk

Tabi: An efficient multi-level inference system for large language models

Y Wang, K Chen, H Tan, K Guo - Proceedings of the Eighteenth …, 2023 - dl.acm.org

Today's trend of building ever larger language models (LLMs), while pushing the
performance of natural language processing, adds significant latency to the inference stage …

被引用次数：29 相关文章所有 2 个版本

[PDF] neurips.cc

: Increasing GPU Utilization during Generative Inference for Higher Throughput

Y Jin, CF Wu, D Brooks, GY Wei - Advances in Neural …, 2023 - proceedings.neurips.cc

Generating texts with a large language model (LLM) consumes massive amounts of
memory. Apart from the already-large model parameters, the key/value (KV) cache that …

被引用次数：15 相关文章所有 4 个版本

[PDF] vldb.org

Optimizing video analytics with declarative model relationships

F Romero, J Hauswald, A Partap, D Kang… - Proceedings of the …, 2022 - dl.acm.org

The availability of vast video collections and the accuracy of ML models has generated
significant interest in video analytics systems. Since naively processing all frames using …

被引用次数：23 相关文章所有 7 个版本

[PDF] arxiv.org

Cloud-Native Computing: A Survey from the Perspective of Services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org

The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

被引用次数：3 相关文章所有 5 个版本

Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge dnn inference serving at scale

K Zhao, Z Zhou, X Chen, R Zhou… - IEEE Transactions …, 2022 - ieeexplore.ieee.org

The accelerating convergence of artificial intelligence and edge computing has sparked a
recent wave of interest in edge intelligence. While pilot efforts focused on edge DNN …

被引用次数：31 相关文章所有 3 个版本

[PDF] github.io

Stepconf: Slo-aware dynamic resource configuration for serverless function workflows

Z Wen, Y Wang, F Liu - IEEE INFOCOM 2022-IEEE Conference …, 2022 - ieeexplore.ieee.org

Function-as-a-Service (FaaS) offers a fine-grained resource provision model, enabling
developers to build highly elastic cloud applications. User requests are handled by a series …

被引用次数：33 相关文章所有 3 个版本

[PDF] github.io

Neuroscaler: Neural video enhancement at scale

H Yeo, H Lim, J Kim, Y Jung, J Ye, D Han - Proceedings of the ACM …, 2022 - dl.acm.org

High-definition live streaming has experienced tremendous growth. However, the video
quality of live video is often limited by the streamer's uplink bandwidth. Recently, neural …

被引用次数：27 相关文章所有 3 个版本

高级搜索

QQ 群