S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural …
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of- the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes …
Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the …
Y Ji, J Wang, Y Gong, L Zhang, Y Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …
C Lei, S Luo, Y Liu, W He, J Wang, G Wang… - Proceedings of the 29th …, 2021 - dl.acm.org
The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural …
Can we leverage the audiovisual information already present in video to improve self- supervised representation learning? To answer this question, we study various pretraining …
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based …
We introduce LAVILA, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …