related:WeeDUs0f6McJ:scholar.google.com/

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

M Liu, F Zhang, X Luo, F Liu, Y Wei, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Video question answering is an increasingly vital research field, spurred by the rapid
proliferation of video content online and the urgent need for intelligent systems that can …

被引用次数：2 相关文章所有 2 个版本

[PDF] thecvf.com

Heterogeneous memory enhanced multimodal attention model for video question answering

C Fan, X Zhang, S Zhang, W Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com

In this paper, we propose a novel end-to-end trainable Video Question Answering
(VideoQA) framework with three major components: 1) a new heterogeneous memory which …

被引用次数：300 相关文章所有 8 个版本

[PDF] thecvf.com

Language-aware Visual Semantic Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan… - Proceedings of the …, 2024 - openaccess.thecvf.com

Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

[PDF] github.io

Pairwise VLAD interaction network for video question answering

H Wang, D Guo, XS Hua, M Wang - Proceedings of the 29th ACM …, 2021 - dl.acm.org

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint
understanding of video and natural language question. Existing methods perform correlation …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

VideoDistill: Language-aware Vision Distillation for Video Question Answering

B Zou, C Yang, Y Qiao, C Quan, Y Zhao - arXiv preprint arXiv:2404.00973, 2024 - arxiv.org

Significant advancements in video question answering (VideoQA) have been made thanks
to thriving large image-language pretraining frameworks. Although these image-language …

Language-Guided Visual Aggregation Network for Video Question Answering

X Liang, D Wang, Q Wang, B Wan, L An… - Proceedings of the 31st …, 2023 - dl.acm.org

Video Question Answering (VideoQA) aims to comprehend intricate relationships, actions,
and events within video content, as well as the inherent links between objects and scenes …

被引用次数：1 相关文章

[PDF] thecvf.com

Bridge to answer: Structure-aware graph interaction network for video question answering

J Park, J Lee, K Sohn - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com

This paper presents a novel method, termed Bridge to Answer, to infer correct answers for
questions about a given video by leveraging adequate graph interactions of heterogeneous …

被引用次数：93 相关文章所有 7 个版本

Redundancy-aware transformer for video question answering

Y Li, X Yang, A Zhang, C Feng, X Wang… - Proceedings of the 31st …, 2023 - dl.acm.org

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically,
the current video encoders tend to holistically embed all video clues at different granularities …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Temporal pyramid transformer with multimodal interaction for video question answering

M Peng, C Wang, Y Gao, Y Shi, XD Zhou - arXiv preprint arXiv:2109.04735, 2021 - arxiv.org

Video question answering (VideoQA) is challenging given its multimodal combination of
visual understanding and natural language understanding. While existing approaches …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

J Liang, X Meng, Y Wang, C Liu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of
multimedia processing, requiring intricate interactions between visual and textual modalities …

高级搜索

QQ 群

Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network

Heterogeneous memory enhanced multimodal attention model for video question answering

Language-aware Visual Semantic Distillation for Video Question Answering

Pairwise VLAD interaction network for video question answering

VideoDistill: Language-aware Vision Distillation for Video Question Answering

Language-Guided Visual Aggregation Network for Video Question Answering

Bridge to answer: Structure-aware graph interaction network for video question answering

Redundancy-aware transformer for video question answering

Temporal pyramid transformer with multimodal interaction for video question answering

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

引用