Text Generation aims to produce plausible and readable text in human language from input data. The resurgence of deep learning has greatly advanced this field, in particular, with the …
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache …
The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging …
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference …
GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have recently attracted huge interest, emphasizing the need for system support for serving models …
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client …
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
J Xu, W Zhou, Z Fu, H Zhou, L Li - arXiv preprint arXiv:2111.05193, 2021 - arxiv.org
In recent years, larger and deeper models are springing up and continuously pushing state- of-the-art (SOTA) results across various fields like natural language processing (NLP) and …