Soft merging of experts with adaptive routing

M Muqeeth, H Liu, C Raffel - arXiv preprint arXiv:2306.03745, 2023 - arxiv.org
Sparsely activated neural networks with conditional computation learn to route their inputs
through different" expert" subnetworks, providing a form of modularity that densely activated …

Frequency decoupling for motion magnification via multi-level isomorphic architecture

F Wang, D Guo, K Li, Z Zhong… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Video Motion Magnification (VMM) aims to reveal subtle and imperceptible motion
information of objects in the macroscopic world. Prior methods directly model the motion …

Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms

YJ Kim, R Henry, R Fahim, HH Awadalla - arXiv preprint arXiv:2308.09723, 2023 - arxiv.org
Large Language Models (LLMs) have achieved state-of-the-art performance across various
language tasks but pose challenges for practical deployment due to their substantial …

Moec: Mixture of expert clusters

Y Xie, S Huang, T Chen, F Wei - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Abstract Sparsely Mixture of Experts (MoE) has received great interest due to its promising
scaling capability with affordable computational overhead. MoE models convert dense …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling

S Shi, X Pan, Q Wang, C Liu, X Ren, Z Hu… - Proceedings of the …, 2024 - dl.acm.org
In recent years, large-scale models can be easily scaled to trillions of parameters with
sparsely activated mixture-of-experts (MoE), which significantly improves the model quality …

MPMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Z Zhang, Y Xia, H Wang, D Yang, C Hu… - … on Parallel and …, 2024 - ieeexplore.ieee.org
In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity
as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of …

Scaling neural machine translation to 200 languages

Nature, 2024 - nature.com
The development of neural techniques has opened up new avenues for research in
machine translation. Today, neural machine translation (NMT) systems can leverage highly …

Task-Based MoE for Multitask Multilingual Machine Translation

H Pham, YJ Kim, S Mukherjee, DP Woodruff… - arXiv preprint arXiv …, 2023 - arxiv.org
Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks
in training deep models in many applications. However, current MoE implementations are …

Towards being parameter-efficient: A stratified sparsely activated transformer with dynamic capacity

H Xu, M Elbayad, K Murray, J Maillard… - arXiv preprint arXiv …, 2023 - arxiv.org
Mixture-of-experts (MoE) models that employ sparse activation have demonstrated
effectiveness in significantly increasing the number of parameters while maintaining low …