Gating dropout: Communication-efficient regularization for sparsely activated transformers

M Muqeeth, H Liu, C Raffel - arXiv preprint arXiv:2306.03745, 2023 - arxiv.org

Sparsely activated neural networks with conditional computation learn to route their inputs
through different" expert" subnetworks, providing a form of modularity that densely activated …

被引用次数：18 相关文章所有 5 个版本

[PDF] thecvf.com

Frequency decoupling for motion magnification via multi-level isomorphic architecture

F Wang, D Guo, K Li, Z Zhong… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Video Motion Magnification (VMM) aims to reveal subtle and imperceptible motion
information of objects in the macroscopic world. Prior methods directly model the motion …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms

YJ Kim, R Henry, R Fahim, HH Awadalla - arXiv preprint arXiv:2308.09723, 2023 - arxiv.org

Large Language Models (LLMs) have achieved state-of-the-art performance across various
language tasks but pose challenges for practical deployment due to their substantial …

被引用次数：8 相关文章所有 3 个版本

[PDF] aaai.org

Moec: Mixture of expert clusters

Y Xie, S Huang, T Chen, F Wei - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org

Abstract Sparsely Mixture of Experts (MoE) has received great interest due to its promising
scaling capability with affordable computational overhead. MoE models convert dense …

被引用次数：8 相关文章所有 6 个版本

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：8 相关文章所有 2 个版本

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling

S Shi, X Pan, Q Wang, C Liu, X Ren, Z Hu… - Proceedings of the …, 2024 - dl.acm.org

In recent years, large-scale models can be easily scaled to trillions of parameters with
sparsely activated mixture-of-experts (MoE), which significantly improves the model quality …

被引用次数：3 相关文章

MPMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Z Zhang, Y Xia, H Wang, D Yang, C Hu… - … on Parallel and …, 2024 - ieeexplore.ieee.org

In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity
as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of …

被引用次数：8 相关文章所有 5 个版本

[PDF] nature.com

Scaling neural machine translation to 200 languages

Nature, 2024 - nature.com

The development of neural techniques has opened up new avenues for research in
machine translation. Today, neural machine translation (NMT) systems can leverage highly …

[PDF] arxiv.org

Task-Based MoE for Multitask Multilingual Machine Translation

H Pham, YJ Kim, S Mukherjee, DP Woodruff… - arXiv preprint arXiv …, 2023 - arxiv.org

Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks
in training deep models in many applications. However, current MoE implementations are …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Towards being parameter-efficient: A stratified sparsely activated transformer with dynamic capacity

H Xu, M Elbayad, K Murray, J Maillard… - arXiv preprint arXiv …, 2023 - arxiv.org

Mixture-of-experts (MoE) models that employ sparse activation have demonstrated
effectiveness in significantly increasing the number of parameters while maintaining low …

被引用次数：2 相关文章所有 5 个版本

高级搜索

QQ 群