M Ding, J Wu, X Dong, X Li, P Qin, T Gan… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the …
Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the …