Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

C Guo, F Cheng, Z Du, J Kiessling, J Ku, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has significantly transformed the
field of artificial intelligence, demonstrating remarkable capabilities in natural language …

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

X He, S Zhang, Y Wang, H Yin, Z Zeng, S Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

Y Wei, J Du, J Jiang, X Shi, X Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org
Recently, the sparsely-gated Mixture-Of-Experts (MoE) architecture has garnered significant
attention. To benefit a wider audience, fine-tuning MoE models on more affordable clusters …

A Survey on Inference Optimization Techniques for Mixture of Experts Models

J Liu, P Tang, W Wang, Y Ren, X Hou, PA Heng… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of large-scale Mixture of Experts (MoE) models has marked a significant
advancement in artificial intelligence, offering enhanced model capacity and computational …

Special Session: Neuro-Symbolic Architecture Meets Large Language Models: A Memory-Centric Perspective

M Ibrahim, Z Wan, H Li, P Panda… - 2024 International …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have significantly transformed the landscape of artificial
intelligence, demonstrating exceptional capabilities in natural language understanding and …

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

Y Zhang, S Aggarwal, T Mitra - arXiv preprint arXiv:2501.10375, 2024 - arxiv.org
Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks,
face significant deployment challenges on memory-constrained devices. While GPUs offer …