Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

M Liu, F Zhang, X Luo, F Liu, Y Wei, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Video question answering is an increasingly vital research field, spurred by the rapid
proliferation of video content online and the urgent need for intelligent systems that can …

Heterogeneous memory enhanced multimodal attention model for video question answering

C Fan, X Zhang, S Zhang, W Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com
In this paper, we propose a novel end-to-end trainable Video Question Answering
(VideoQA) framework with three major components: 1) a new heterogeneous memory which …

Language-aware Visual Semantic Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

Pairwise VLAD interaction network for video question answering

H Wang, D Guo, XS Hua, M Wang - Proceedings of the 29th ACM …, 2021 - dl.acm.org
Video Question Answering (VideoQA) is a challenging problem, as it requires a joint
understanding of video and natural language question. Existing methods perform correlation …

VideoDistill: Language-aware Vision Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan, Y Zhao - arXiv preprint arXiv:2404.00973, 2024 - arxiv.org
Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

Language-Guided Visual Aggregation Network for Video Question Answering

X Liang, D Wang, Q Wang, B Wan, L An… - Proceedings of the 31st …, 2023 - dl.acm.org
Video Question Answering (VideoQA) aims to comprehend intricate relationships, actions,
and events within video content, as well as the inherent links between objects and scenes …

Bridge to answer: Structure-aware graph interaction network for video question answering

J Park, J Lee, K Sohn - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for
questions about a given video by leveraging adequate graph interactions of heterogeneous …

Redundancy-aware transformer for video question answering

Y Li, X Yang, A Zhang, C Feng, X Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically,
the current video encoders tend to holistically embed all video clues at different granularities …

Temporal pyramid transformer with multimodal interaction for video question answering

M Peng, C Wang, Y Gao, Y Shi, XD Zhou - arXiv preprint arXiv:2109.04735, 2021 - arxiv.org
Video question answering (VideoQA) is challenging given its multimodal combination of
visual understanding and natural language understanding. While existing approaches …

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

J Liang, X Meng, Y Wang, C Liu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of
multimedia processing, requiring intricate interactions between visual and textual modalities …