Zero-shot composed image retrieval with textual inversion

A Baldrati, L Agnolucci, M Bertini… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Composed Image Retrieval (CIR) aims to retrieve a target image based on a query
composed of a reference image and a relative caption that describes the difference between …

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

Fashion iq: A new dataset towards retrieving images by natural language feedback

H Wu, Y Gao, X Guo, Z Al-Halah… - Proceedings of the …, 2021 - openaccess.thecvf.com
Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …

Can language models encode perceptual structure without grounding? a case study in color

M Abdou, A Kulmizev, D Hershcovich, S Frank… - arXiv preprint arXiv …, 2021 - arxiv.org
Pretrained language models have been shown to encode relational information, such as the
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …

Image retrieval from contextual descriptions

B Krojer, V Adlakha, V Vineet, Y Goyal, E Ponti… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

X Hu, L Gu, Q An, M Zhang, L Liu, K Kobayashi… - Proceedings of the 29th …, 2023 - dl.acm.org
To contribute to automating the medical vision-language model, we propose a novel Chest-
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …

Modality-agnostic attention fusion for visual search with text feedback

E Dodds, J Culpepper, S Herdade, Y Zhang… - arXiv preprint arXiv …, 2020 - arxiv.org
Image retrieval with natural language feedback offers the promise of catalog search based
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …

Step Differences in Instructional Video

T Nagarajan, L Torresani - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Comparing a user video to a reference how-to video is a key requirement for AR/VR
technology delivering personalized assistance tailored to the user's progress. However …

SAC: Semantic attention composition for text-conditioned image retrieval

S Jandial, P Badjatiya, P Chawla… - Proceedings of the …, 2022 - openaccess.thecvf.com
The ability to efficiently search for images is essential for improving the user experiences
across various products. Incorporating user feedback, via multi-modal inputs, to navigate …