Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv e-prints, 2021 - ui.adsabs.harvard.edu
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …