V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

P Wu, S Xie - Proceedings of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
When we look around and perform complex tasks how we see and selectively process what
we see is crucial. However the lack of this visual search mechanism in current multimodal …

Physically grounded vision-language models for robotic manipulation

J Gao, B Sarkar, F Xia, T Xiao, J Wu… - … on Robotics and …, 2024 - ieeexplore.ieee.org
Recent advances in vision-language models (VLMs) have led to improved performance on
tasks such as visual question answering and image captioning. Consequently, these models …

No representation rules them all in category discovery

S Vaze, A Vedaldi, A Zisserman - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically,
given a dataset with labelled and unlabelled images, the task is to cluster all images in the …

Paco: Parts and attributes of common objects

V Ramanathan, A Kalia, V Petrovic… - Proceedings of the …, 2023 - openaccess.thecvf.com
Object models are gradually progressing from predicting just category labels to providing
detailed descriptions of object instances. This motivates the need for large datasets which …

Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning

Z Xu, Y Shen, L Huang - arXiv preprint arXiv:2212.10773, 2022 - arxiv.org
Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on
tasks specified through instructions, has shown promising zero-shot performance on various …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Ovarnet: Towards open-vocabulary object attribute recognition

K Chen, X Jiang, Y Hu, X Tang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we consider the problem of simultaneously detecting objects and inferring their
visual attributes in an image, even for those with no manual annotations provided at the …

Language-only training of zero-shot composed image retrieval

G Gu, S Chun, W Kim, Y Kang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Composed image retrieval (CIR) task takes a composed query of image and text aiming to
search relative images for both conditions. Conventional CIR approaches need a training …