Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - arXiv preprint arXiv …, 2023 - arxiv.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

A Survey on Scheduling Techniques in Computing and Network Convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

Graft: Efficient inference serving for hybrid deep learning with SLO guarantees via DNN re-alignment

J Wu, L Wang, Q Jin, F Liu - IEEE Transactions on Parallel and …, 2023 - ieeexplore.ieee.org
Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks,
yet their ever-increasing computational demands are hindering their deployment on …

AsyFunc: A high-performance and resource-efficient serverless inference system via asymmetric functions

Q Pei, Y Yuan, H Hu, Q Chen, F Liu - … of the 2023 ACM Symposium on …, 2023 - dl.acm.org
Recent advances in deep learning (DL) have spawned various intelligent cloud services
with well-trained DL models. Nevertheless, it is nontrivial to maintain the desired end-to-end …

Memory orchestration mechanisms in serverless computing: a taxonomy, review and future directions

Z Shojaee rad, M Ghobaei-Arani, R Ahsan - Cluster Computing, 2024 - Springer
Serverless computing has become very popular in recent years due to its flexibility and cost
efficiency. Serverless computing is a cloud computing model that allows developers to write …

FaST-GShare: Enabling efficient spatio-temporal GPU sharing in serverless computing for deep learning inference

J Gu, Y Zhu, P Wang, M Chadha, M Gerndt - Proceedings of the 52nd …, 2023 - dl.acm.org
Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference
due to the ease of deployment and pay-per-use benefits. However, existing FaaS platforms …

Mira: A program-behavior-guided far memory system

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation

Z Hong, J Lin, S Guo, S Luo, W Chen… - Proceedings of the …, 2024 - dl.acm.org
Serverless ML inference is an emerging cloud computing paradigm for low-cost, easy-to-
manage inference services. In serverless ML inference, each call is executed in a container; …

Fine-Grained Management for Microservice Applications with Lazy Configuration Distribution

N Wang, L Wang, X Li, X Qin - Electronics, 2023 - mdpi.com
Service mesh is gaining popularity as a microservice architecture paradigm due to its
lightness, transparency, and scalability. However, fully releasing configurations to the data …

Rethinking Deployment for Serverless Functions: A Performance-First Perspective

Y Li, L Zhao, Y Yang, W Qu - … of the International Conference for High …, 2023 - dl.acm.org
Serverless computing commonly adopts strong isolation mechanisms for deploying
functions, which may bring significant performance overhead because each function needs …