YC Chen, L Li, L Yu, AE Kholy, F Ahmed, Z Gan… - arXiv preprint arXiv …, 2019 - arxiv.org
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …