Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Deferred continuous batching in resource-efficient large language model serving

Y He, Y Lu, G Alonso - Proceedings of the 4th Workshop on Machine …, 2024 - dl.acm.org
Despite that prior work of batched inference and parameter-efficient fine-tuning techniques
have reduced the resource requirements of large language models (LLMs), challenges …

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X Jin - arXiv preprint arXiv:2404.09526, 2024 - arxiv.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

C Lin, Z Han, C Zhang, Y Yang, F Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rise of large language models (LLMs) has enabled LLM-based applications (aka AI
agents or co-pilots), a new software paradigm that combines the strength of LLM and …

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

J Bang, Y Choi, M Kim, Y Kim, M Rhu - arXiv preprint arXiv:2312.12391, 2023 - arxiv.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

MLTCP: Congestion Control for DNN Training

S Rajasekaran, S Narang, AA Zabreyko… - arXiv preprint arXiv …, 2024 - arxiv.org
We present MLTCP, a technique to augment today's congestion control algorithms to
accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication …

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Y Zhong, S Liu, J Chen, J Hu, Y Zhu, X Liu, X Jin… - arXiv preprint arXiv …, 2024 - arxiv.org
DistServe improves the performance of large language models (LLMs) serving by
disaggregating the prefill and decoding computation. Existing LLM serving systems colocate …

Training DNN Models over Heterogeneous Clusters with Optimal Performance

C Nie, J Maghakian, Z Liu - arXiv preprint arXiv:2402.05302, 2024 - arxiv.org
Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed
up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters …

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

C Xue, W Cui, H Zhao, Q Chen, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Joint consideration of scheduling and adaptive parallelism offers great opportunities for
improving the training efficiency of large models on heterogeneous GPU clusters. However …

Asymptotically Optimal Scheduling of Multiple Parallelizable Job Classes

B Berg, B Moseley, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Many modern computing workloads are composed of parallelizable jobs. A single
parallelizable job can be completed more quickly if it is run on additional servers, however …