Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task …
Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened …
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
Z Wang, Y Li, X Chen, SN Lim… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited …
Leveraging the extensive training data from SA-1B, the segment anything model (SAM) demonstrates remarkable generalization and zero-shot capabilities. However, as a category …
D Kim, A Angelova, W Kuo - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract We present Region-aware Open-vocabulary Vision Transformers (RO-ViT)--a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining …
Deriving reliable region-word alignment from image-text pairs is critical to learnobject-level vision-language representations for open-vocabulary object detection. Existing methods …
In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However …