T Wu, C Hou, S Lao, J Li, N Wong,
Z Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-
based methods focus on designing extra alignment losses for the student model to mimic the …