Cascade reasoning network for text-based visual question answering

F Liu, G Xu, Q Wu, Q Du, W Jia, M Tan - Proceedings of the 28th ACM …, 2020 - dl.acm.org
F Liu, G Xu, Q Wu, Q Du, W Jia, M Tan
Proceedings of the 28th ACM International Conference on Multimedia, 2020dl.acm.org
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike
general visual question answering (VQA) which only builds connections between questions
and visual contents, T-VQA requires reading and reasoning over both texts and visual
concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is
difficult to understand the complex logic in questions and extract specific useful information
from rich image contents to answer them; 2) The text-related questions are also related to …
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果