C Chen, P Li, M Sun, Y Liu - … of the 2022 Conference on Empirical …, 2022 - aclanthology.org
Recently there has been an emerging interest in unsupervised vision-and-language pre- training (VLP) that learns multimodal representations without parallel image-caption data …
Referring expression comprehension (REC) system solves a task to localize objects in a given image, based on natural language expression. We propose a novel approach to …