We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image. Existing methods have only been …
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Conversational interfaces for the detail-oriented retail fashion domain are more natural, expressive, and user friendly than classical keyword-based search interfaces. In this paper …
Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current …
To contribute to automating the medical vision-language model, we propose a novel Chest- Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …
Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real …
T Nagarajan, L Torresani - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However …
The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate …