Distilled dual-encoder model for vision-language understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv preprint arXiv …, 2021 - arxiv.org
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

Distilled Dual-Encoder Model for Vision-Language Understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv e-prints, 2021 - ui.adsabs.harvard.edu
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

Distilled Dual-Encoder Model for Vision-Language Understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin… - Proceedings of the 2022 …, 2022 - aclanthology.org
On vision-language understanding (VLU) tasks, fusion-encoder vision-language models
achieve superior results but sacrifice efficiency because of the simultaneous encoding of …

[PDF][PDF] Distilled Dual-Encoder Model for Vision-Language Understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - researchgate.net
We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …