Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning

S Lee, J Chung, Y Yu, G Kim, T Breuel… - Proceedings of the …, 2021 - openaccess.thecvf.com
The natural association between visual observations and their corresponding sound
provides powerful self-supervisory signals for learning video representations, which makes …

Masking modalities for cross-modal video retrieval

V Gabeur, A Nagrani, C Sun… - Proceedings of the …, 2022 - openaccess.thecvf.com
Pre-training on large scale unlabelled datasets has shown impressive performance
improvements in the fields of computer vision and natural language processing. Given the …

Map: Multimodal uncertainty-aware vision-language pre-training model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

Understanding chinese video and language via contrastive multimodal pre-training

C Lei, S Luo, Y Liu, W He, J Wang, G Wang… - Proceedings of the 29th …, 2021 - dl.acm.org
The pre-trained neural models have recently achieved impressive performance in
understanding multimodal content. However, it is still very challenging to pre-train neural …

Audiovisual masked autoencoders

MI Georgescu, E Fonseca, RT Ionescu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Can we leverage the audiovisual information already present in video to improve self-
supervised representation learning? To answer this question, we study various pretraining …

Omnivl: One foundation model for image-language and video-language tasks

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …