W Shen, Z Peng, X Wang, H Wang… - IEEE transactions on …, 2023 - ieeexplore.ieee.org
The rapid development of deep learning has made a great progress in image segmentation, one of the fundamental tasks of computer vision. However, the current segmentation …
We present ImageBind, an approach to learn a joint embedding across six different modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or …
Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse, high-quality images. However, directly applying these models for real …
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model …
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task …