The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision …
We present ImageBind, an approach to learn a joint embedding across six different modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
M Singh, Q Duval, KV Alwala, H Fan… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using …
Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning …
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …
Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of …
Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to …
Can we leverage the audiovisual information already present in video to improve self- supervised representation learning? To answer this question, we study various pretraining …