P Lu, A Ghaddar, A Rashid, M Rezagholizadeh… - pdfs.semanticscholar.org
Abstract Knowledge Distillation (KD) is extensively used in Natural Language Processing to
compress the pre-training and task-specific finetuning phases of large neural language …