Y Gu, H Zhou,
F Meng, J Zhou,
M Huang - arXiv preprint arXiv:2410.17215, 2024 - arxiv.org
Knowledge distillation (KD) is widely used to train small, high-performing student language
models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training …