We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired …
J Huang, Y Li, J Feng, X Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Building a universal video-language model for solving various video understanding tasks (eg, text-video retrieval, video question answering) is an open challenge to the machine …
P Bagad, M Tapaswi… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Modelling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful …
H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video …
With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models …
H Xu, G Ghosh, PY Huang, P Arora… - arXiv preprint arXiv …, 2021 - arxiv.org
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task …
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for …
Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …