相关文章- 学术资源搜索

X-detr: A versatile architecture for instance-wise vision-language tasks

Z Cai, G Kwon, A Ravichandran, E Bas, Z Tu… - … on Computer Vision, 2022 - Springer

In this paper, we study the challenging instance-wise vision-language tasks, where the free-
form language is required to align with the objects instead of the whole image. To address …

被引用次数：43 相关文章所有 7 个版本

[PDF] thecvf.com

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

被引用次数：905 相关文章所有 8 个版本

[PDF] arxiv.org

Taskclip: Extend large vision-language model for task oriented object detection

H Chen, W Huang, Y Ni, S Yun, F Wen… - arXiv preprint arXiv …, 2024 - arxiv.org

Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
As a challenging task, it requires simultaneous visual data processing and reasoning under …

被引用次数：6 相关文章所有 2 个版本

[PDF] thecvf.com

Learning to prompt for open-vocabulary object detection with vision-language model

Y Du, F Wei, Z Zhang, M Shi… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recently, vision-language pre-training shows great potential in open-vocabulary object
detection, where detectors trained on base classes are devised for detecting new classes …

被引用次数：242 相关文章所有 10 个版本

[PDF] thecvf.com

Yolo-world: Real-time open-vocabulary object detection

T Cheng, L Song, Y Ge, W Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However their reliance on predefined and trained object …

被引用次数：31 相关文章所有 3 个版本

[PDF] arxiv.org

Exploiting unlabeled data with vision and language models for object detection

S Zhao, Z Zhang, S Schulter, L Zhao… - European conference on …, 2022 - Springer

Building robust and generic object detection frameworks requires scaling to larger label
spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations …

被引用次数：72 相关文章所有 8 个版本

[PDF] thecvf.com

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

L Yao, J Han, X Liang, D Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper presents DetCLIPv2, an efficient and scalable training framework that
incorporates large-scale image-text pairs to achieve open-vocabulary object detection …

被引用次数：44 相关文章所有 5 个版本

[PDF] thecvf.com

Contrastive feature masking open-vocabulary vision transformer

D Kim, A Angelova, W Kuo - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract We present Contrastive Feature Masking Vision Transformer (CFM-ViT)-an image-
text pretraining methodology that achieves simultaneous learning of image-and region level …

被引用次数：13 相关文章所有 5 个版本

[PDF] thecvf.com

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S Jin, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …

被引用次数：58 相关文章所有 5 个版本

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

被引用次数：34 相关文章所有 4 个版本

高级搜索

QQ 群

X-detr: A versatile architecture for instance-wise vision-language tasks

Vinvl: Revisiting visual representations in vision-language models

Taskclip: Extend large vision-language model for task oriented object detection

Learning to prompt for open-vocabulary object detection with vision-language model

Yolo-world: Real-time open-vocabulary object detection

Exploiting unlabeled data with vision and language models for object detection

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

Contrastive feature masking open-vocabulary vision transformer

Aligning bag of regions for open-vocabulary object detection

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

相关搜索

引用