VideoDistill: Language-aware Vision Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan, Y Zhao - arXiv preprint arXiv:2404.00973, 2024 - arxiv.org
Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

Language-aware Visual Semantic Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

Encoding and Controlling Global Semantics for Long-form Video Question Answering

TT Nguyen, Z Hu, X Wu, CDT Nguyen, SK Ng… - arXiv preprint arXiv …, 2024 - arxiv.org
Seeking answers effectively for long videos is essential to build video question answering
(videoQA) systems. Previous methods adaptively select frames and regions from long …

Heterogeneous memory enhanced multimodal attention model for video question answering

C Fan, X Zhang, S Zhang, W Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com
In this paper, we propose a novel end-to-end trainable Video Question Answering
(VideoQA) framework with three major components: 1) a new heterogeneous memory which …

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

J Liang, X Meng, Y Wang, C Liu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of
multimedia processing, requiring intricate interactions between visual and textual modalities …

Harnessing Representative Spatial-Temporal Information for Video Question Answering

Y Wang, M Liu, X Song, L Nie - ACM Transactions on Multimedia Computing … - dl.acm.org
Video question answering, aiming to answer a natural language question related to the
given video, has become prevalent in the past few years. Although remarkable …

Compositional attention networks with two-stream fusion for video question answering

T Yu, J Yu, Z Yu, D Tao - IEEE Transactions on Image …, 2019 - ieeexplore.ieee.org
Given a video, Video Question Answering (VideoQA) aims at answering arbitrary free-form
questions about the video content in natural language. A successful VideoQA framework …

Erm: Energy-based refined-attention mechanism for video question answering

F Zhang, R Wang, F Zhou, Y Luo - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Spatiotemporal attention learning remains a challenging video question answering
(VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal …

Language-Guided Visual Aggregation Network for Video Question Answering

X Liang, D Wang, Q Wang, B Wan, L An… - Proceedings of the 31st …, 2023 - dl.acm.org
Video Question Answering (VideoQA) aims to comprehend intricate relationships, actions,
and events within video content, as well as the inherent links between objects and scenes …