The core of video understanding tasks such as recognition captioning and tracking is to automatically detect objects or actions in a video and analyze their temporal evolution …
M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many …
DW Lee, C Ahuja, PP Liang, S Natu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Many educational videos use slide presentations, a sequence of visual pages that contain text and figures accompanied by spoken language, which are constructed and presented …
Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing …
C Liang, W Wang, T Zhou… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in …
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and …
L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze …
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …