Self-supervised representation learning (SSRL) methods aim to provide powerful, deep feature learning without the requirement of large annotated data sets, thus alleviating the …
We present ImageBind, an approach to learn a joint embedding across six different modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video …
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce …
Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually …
Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video …
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human …
We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based …