Learning to predict visual attributes in the wild

P Wu, S Xie - Proceedings of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

When we look around and perform complex tasks how we see and selectively process what
we see is crucial. However the lack of this visual search mechanism in current multimodal …

被引用次数：53 相关文章

[PDF] arxiv.org

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu… - … on Robotics and …, 2024 - ieeexplore.ieee.org

Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

被引用次数：67 相关文章所有 2 个版本

[PDF] neurips.cc

No representation rules them all in category discovery

S Vaze, A Vedaldi, A Zisserman - Advances in Neural …, 2024 - proceedings.neurips.cc

In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically,
given a dataset with labelled and unlabelled images, the task is to cluster all images in the …

被引用次数：22 相关文章所有 6 个版本

[PDF] thecvf.com

Paco: Parts and attributes of common objects

V Ramanathan, A Kalia, V Petrovic… - Proceedings of the …, 2023 - openaccess.thecvf.com

Object models are gradually progressing from predicting just category labels to providing
detailed descriptions of object instances. This motivates the need for large datasets which …

被引用次数：67 相关文章所有 5 个版本

[PDF] arxiv.org

Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning

Z Xu, Y Shen, L Huang - arXiv preprint arXiv:2212.10773, 2022 - arxiv.org

Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on
tasks specified through instructions, has shown promising zero-shot performance on various …

被引用次数：93 相关文章所有 5 个版本

[PDF] thecvf.com

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

被引用次数：57 相关文章所有 8 个版本

[PDF] thecvf.com

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

被引用次数：35 相关文章所有 12 个版本

[PDF] neurips.cc

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc

Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

被引用次数：33 相关文章所有 7 个版本

[PDF] thecvf.com

Ovarnet: Towards open-vocabulary object attribute recognition

K Chen, X Jiang, Y Hu, X Tang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we consider the problem of simultaneously detecting objects and inferring their
visual attributes in an image, even for those with no manual annotations provided at the …

被引用次数：36 相关文章所有 6 个版本

[PDF] thecvf.com

Language-only training of zero-shot composed image retrieval

G Gu, S Chun, W Kim, Y Kang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Composed image retrieval (CIR) task takes a composed query of image and text aiming to
search relative images for both conditions. Conventional CIR approaches need a training …

被引用次数：16 相关文章所有 5 个版本

高级搜索

QQ 群