Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

Clover: Towards a unified video-language alignment and fusion model

J Huang, Y Li, J Feng, X Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Building a universal video-language model for solving various video understanding tasks
(eg, text-video retrieval, video question answering) is an open challenge to the machine …

Tem-adapter: Adapting image-text pretraining for video question answer

G Chen, X Liu, G Wang, K Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-trained models have shown remarkable success in guiding video
question-answering (VideoQA) tasks. However, due to the length of video sequences …

Occluded prohibited items detection: An x-ray security inspection benchmark and de-occlusion attention module

Y Wei, R Tao, Z Wu, Y Ma, L Zhang, X Liu - Proceedings of the 28th ACM …, 2020 - dl.acm.org
Security inspection often deals with a piece of baggage or suitcase where objects are
heavily overlapped with each other, resulting in an unsatisfactory performance for prohibited …

Structured multi-level interaction network for video moment localization via language query

H Wang, ZJ Zha, L Li, D Liu… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We address the problem of localizing a specific moment described by a natural language
query. Existing works interact the query with either video frame or moment proposal, and …

Hierarchical representation network with auxiliary tasks for video captioning and video question answering

L Gao, Y Lei, P Zeng, J Song, M Wang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Recently, integrating vision and language for in-depth video understanding eg, video
captioning and video question answering, has become a promising direction for artificial …

Learning from inside: Self-driven siamese sampling and reasoning for video question answering

W Yu, H Zheng, M Li, L Ji, L Wu… - Advances in Neural …, 2021 - proceedings.neurips.cc
Recent advances in the video question answering (ie, VideoQA) task have achieved strong
success by following the paradigm of fine-tuning each clip-text pair independently on the …

Learning to answer visual questions from web videos

A Yang, A Miech, J Sivic, I Laptev, C Schmid - arXiv preprint arXiv …, 2022 - arxiv.org
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …