Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com
We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

Clover: Towards a unified video-language alignment and fusion model

J Huang, Y Li, J Feng, X Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Building a universal video-language model for solving various video understanding tasks
(eg, text-video retrieval, video question answering) is an open challenge to the machine …

Test of time: Instilling video-language models with a sense of time

P Bagad, M Tapaswi… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Modelling and understanding time remains a challenge in contemporary video
understanding models. With language emerging as a key driver towards powerful …

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

Vlm: Task-agnostic video-language model pre-training for video understanding

H Xu, G Ghosh, PY Huang, P Arora… - arXiv preprint arXiv …, 2021 - arxiv.org
We present a simplified, task-agnostic multi-modal pre-training approach that can accept
either video or text input, or both for a variety of end tasks. Existing pre-training are task …

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Temporal perceiving video-language pre-training

F Ma, X Jin, H Wang, J Huang, L Zhu, J Feng… - arXiv preprint arXiv …, 2023 - arxiv.org
Video-Language Pre-training models have recently significantly improved various multi-
modal downstream tasks. Previous dominant works mainly adopt contrastive learning to …

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …