This monograph surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches …
Abstract The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be …
In this paper, a knowledge base graph embedding module is constructed to extend the versatility of knowledge-based VQA (Visual Question Answering) models. The knowledge …
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and …
Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene …
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge …
Z Cai, N Vasconcelos - … of the IEEE conference on computer …, 2018 - openaccess.thecvf.com
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, eg 0.5, usually produces …
Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in …