As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce …
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired …
P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal …
J Li, L Niu, L Zhang - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current …
Recently, several multi-modal models have been developed for joint image and language understanding, which have demonstrated impressive chat abilities by utilizing advanced …
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we …
We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer- based sequence-to-sequence model capable of reasoning about commonsense knowledge …
Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …