Revisiting the" video" in video-language understanding

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：142 相关文章所有 7 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：101 相关文章所有 4 个版本

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

被引用次数：257 相关文章所有 6 个版本

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

被引用次数：66 相关文章所有 7 个版本

[PDF] caltech.edu

[PDF][PDF] Vima: General robot manipulation with multimodal prompts

Y Jiang, A Gupta, Z Zhang, G Wang… - arXiv preprint …, 2022 - authors.library.caltech.edu

Prompt-based learning has emerged as a successful paradigm in natural language
processing, where a single general-purpose language model can be instructed to perform …

被引用次数：136 相关文章所有 6 个版本

[PDF] thecvf.com

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

被引用次数：57 相关文章所有 8 个版本

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

被引用次数：40 相关文章所有 6 个版本

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

被引用次数：40 相关文章所有 6 个版本

[PDF] thecvf.com

Vindlu: A recipe for effective video-and-language pretraining

F Cheng, X Wang, J Lei, D Crandall… - Proceedings of the …, 2023 - openaccess.thecvf.com

The last several years have witnessed remarkable progress in video-and-language (VidL)
understanding. However, most modern VidL approaches use complex and specialized …

被引用次数：52 相关文章所有 8 个版本

[PDF] arxiv.org

Cross-modal causal relational reasoning for event-level visual question answering

Y Liu, G Li, L Lin - IEEE Transactions on Pattern Analysis and …, 2023 - ieeexplore.ieee.org

Existing visual question answering methods often suffer from cross-modal spurious
correlations and oversimplified event-level reasoning processes that fail to capture event …

被引用次数：80 相关文章所有 7 个版本

高级搜索

QQ 群