Multi-granularity relational attention network for audio-visual question answering

L Li, T Jin, W Lin, H Jiang, W Pan… - … on Circuits and …, 2023 - ieeexplore.ieee.org
Recent methods for video question answering (VideoQA), aiming to generate answers
based on given questions and video content, have made significant progress in cross-modal …

A universal quaternion hypergraph network for multimodal video question answering

Z Guo, J Zhao, L Jiao, X Liu, F Liu - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Fusion and interaction of multimodal features are essential for video question answering.
Structural information composed of the relationships between different objects in videos is …

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

J Chung, Y Yu - arXiv preprint arXiv:2311.01233, 2023 - arxiv.org
Large language models such as GPT-3 have demonstrated an impressive capability to
adapt to new tasks without requiring task-specific training data. This capability has been …

Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering

M Peng, X Shao, Y Shi, X Zhou - ACM Transactions on Multimedia …, 2023 - dl.acm.org
Video question answering (VideoQA) is challenging as it requires reasoning about natural
language and multimodal interactive relations. Most existing methods apply attention …

Temporal pyramid transformer with multimodal interaction for video question answering

M Peng, C Wang, Y Gao, Y Shi, XD Zhou - arXiv preprint arXiv:2109.04735, 2021 - arxiv.org
Video question answering (VideoQA) is challenging given its multimodal combination of
visual understanding and natural language understanding. While existing approaches …

Multi-Semantic Alignment Co-Reasoning Network for Video Question Answering

M Peng, L Liu, Z Li, Y Shi, X Zhou - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Video question answering challenges models on understanding textual questions with
varying complexity and searching for clues from visual content with different hierarchical …

Triple Attention Network architecture for MovieQA

A Shah, TH Lin, S Wu - arXiv preprint arXiv:2111.09531, 2021 - arxiv.org
Movie question answering, or MovieQA is a multimedia related task wherein one is provided
with a video, the subtitle information, a question and candidate answers for it. The task is to …

Time-Evolving Conditional Character-centric Graphs for Movie Understanding

LH Dang, TM Le, V Le, TM Phuong, T Tran - NeurIPS 2022 Temporal … - openreview.net
Temporal graph structure learning for long-term human-centric video understanding is
promising but remains challenging due to the scarcity of dense graph annotations for long …