Heterogeneous memory enhanced multimodal attention model for video question answering

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com

Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

被引用次数：170 相关文章所有 8 个版本

[PDF] mlr.press

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press

Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

被引用次数：2695 相关文章所有 5 个版本

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

被引用次数：169 相关文章所有 11 个版本

[PDF] thecvf.com

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

被引用次数：178 相关文章所有 4 个版本

[PDF] thecvf.com

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com

The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

被引用次数：650 相关文章所有 8 个版本

[PDF] thecvf.com

Align and prompt: Video-and-language pre-training with entity prompts

D Li, J Li, H Li, JC Niebles… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a …

被引用次数：178 相关文章所有 5 个版本

[PDF] arxiv.org

Violet: End-to-end video-language transformers with masked visual-token modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang… - arXiv preprint arXiv …, 2021 - arxiv.org

A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …

被引用次数：190 相关文章所有 2 个版本

[PDF] springer.com

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer

The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

被引用次数：274 相关文章所有 7 个版本

[PDF] thecvf.com

Invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

被引用次数：102 相关文章所有 5 个版本

[PDF] thecvf.com

Learning memory-guided normality for anomaly detection

H Park, J Noh, B Ham - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com

We address the problem of anomaly detection, that is, detecting anomalous events in a
video sequence. Anomaly detection methods based on convolutional neural networks …

被引用次数：741 相关文章所有 10 个版本

高级搜索

QQ 群