A Survey on Inference Optimization Techniques for Mixture of Experts Models

J Liu, P Tang, W Wang, Y Ren, X Hou, PA Heng… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of large-scale Mixture of Experts (MoE) models has marked a significant
advancement in artificial intelligence, offering enhanced model capacity and computational …

MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Memory

J Ren, D Xu, J Ryu, K Shin, D Kim, D Li - Proceedings of the Nineteenth …, 2024 - dl.acm.org
Multi-terabyte large memory systems are often characterized by more than two memory tiers
with different latency and bandwidth. Multi-tiered large memory systems call for rethinking of …

FACET: On-the-Fly Activation Compression for Efficient Transformer Training

S Lee, G Yun, XT Nguyen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Training Transformer models, known for their outstanding performance in various tasks, can
be challenging due to extensive training times and substantial memory requirements. One …

[PDF][PDF] LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control

Large language models (LLMs) have achieved remarkable success in various natural
language processing tasks. However, LLM inference is highly computational and memory …