Spatiotemporal-textual co-attention network for video question answering

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier

Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

被引用次数：107 相关文章所有 5 个版本

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

被引用次数：227 相关文章所有 11 个版本

[PDF] thecvf.com

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

被引用次数：320 相关文章所有 14 个版本

[PDF] thecvf.com

Clover: Towards a unified video-language alignment and fusion model

J Huang, Y Li, J Feng, X Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Building a universal video-language model for solving various video understanding tasks
(eg, text-video retrieval, video question answering) is an open challenge to the machine …

被引用次数：62 相关文章所有 5 个版本

[PDF] thecvf.com

Tem-adapter: Adapting image-text pretraining for video question answer

G Chen, X Liu, G Wang, K Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-trained models have shown remarkable success in guiding video
question-answering (VideoQA) tasks. However, due to the length of video sequences …

被引用次数：16 相关文章所有 6 个版本

[PDF] arxiv.org

Occluded prohibited items detection: An x-ray security inspection benchmark and de-occlusion attention module

Y Wei, R Tao, Z Wu, Y Ma, L Zhang, X Liu - Proceedings of the 28th ACM …, 2020 - dl.acm.org

Security inspection often deals with a piece of baggage or suitcase where objects are
heavily overlapped with each other, resulting in an unsatisfactory performance for prohibited …

被引用次数：216 相关文章所有 4 个版本

[PDF] thecvf.com

Structured multi-level interaction network for video moment localization via language query

H Wang, ZJ Zha, L Li, D Liu… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We address the problem of localizing a specific moment described by a natural language
query. Existing works interact the query with either video frame or moment proposal, and …

被引用次数：99 相关文章所有 4 个版本

Hierarchical representation network with auxiliary tasks for video captioning and video question answering

L Gao, Y Lei, P Zeng, J Song, M Wang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Recently, integrating vision and language for in-depth video understanding eg, video
captioning and video question answering, has become a promising direction for artificial …

被引用次数：76 相关文章所有 4 个版本

[PDF] neurips.cc

Learning from inside: Self-driven siamese sampling and reasoning for video question answering

W Yu, H Zheng, M Li, L Ji, L Wu… - Advances in Neural …, 2021 - proceedings.neurips.cc

Recent advances in the video question answering (ie, VideoQA) task have achieved strong
success by following the paradigm of fine-tuning each clip-text pair independently on the …

被引用次数：46 相关文章所有 7 个版本

[PDF] arxiv.org

Learning to answer visual questions from web videos

A Yang, A Miech, J Sivic, I Laptev, C Schmid - arXiv preprint arXiv …, 2022 - arxiv.org

Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

被引用次数：43 相关文章所有 10 个版本

高级搜索

QQ 群