X-detr: A versatile architecture for instance-wise vision-language tasks

Z Cai, G Kwon, A Ravichandran, E Bas, Z Tu… - … on Computer Vision, 2022 - Springer
In this paper, we study the challenging instance-wise vision-language tasks, where the free-
form language is required to align with the objects instead of the whole image. To address …

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

Taskclip: Extend large vision-language model for task oriented object detection

H Chen, W Huang, Y Ni, S Yun, F Wen… - arXiv preprint arXiv …, 2024 - arxiv.org
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
As a challenging task, it requires simultaneous visual data processing and reasoning under …

Learning to prompt for open-vocabulary object detection with vision-language model

Y Du, F Wei, Z Zhang, M Shi… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recently, vision-language pre-training shows great potential in open-vocabulary object
detection, where detectors trained on base classes are devised for detecting new classes …

Yolo-world: Real-time open-vocabulary object detection

T Cheng, L Song, Y Ge, W Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However their reliance on predefined and trained object …

Exploiting unlabeled data with vision and language models for object detection

S Zhao, Z Zhang, S Schulter, L Zhao… - European conference on …, 2022 - Springer
Building robust and generic object detection frameworks requires scaling to larger label
spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations …

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

L Yao, J Han, X Liang, D Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper presents DetCLIPv2, an efficient and scalable training framework that
incorporates large-scale image-text pairs to achieve open-vocabulary object detection …

Contrastive feature masking open-vocabulary vision transformer

D Kim, A Angelova, W Kuo - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract We present Contrastive Feature Masking Vision Transformer (CFM-ViT)-an image-
text pretraining methodology that achieves simultaneous learning of image-and region level …

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S Jin, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …