Neural naturalist: Generating fine-grained image comparisons

A Baldrati, L Agnolucci, M Bertini… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Composed Image Retrieval (CIR) aims to retrieve a target image based on a query
composed of a reference image and a relative caption that describes the difference between …

被引用次数：55 相关文章所有 7 个版本

[PDF] thecvf.com

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com

We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

被引用次数：138 相关文章所有 7 个版本

[PDF] openreview.net

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

被引用次数：28 相关文章所有 2 个版本

[PDF] thecvf.com

Fashion iq: A new dataset towards retrieving images by natural language feedback

H Wu, Y Gao, X Guo, Z Al-Halah… - Proceedings of the …, 2021 - openaccess.thecvf.com

Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …

被引用次数：212 相关文章所有 7 个版本

[PDF] arxiv.org

Can language models encode perceptual structure without grounding? a case study in color

M Abdou, A Kulmizev, D Hershcovich, S Frank… - arXiv preprint arXiv …, 2021 - arxiv.org

Pretrained language models have been shown to encode relational information, such as the
relations between entities or concepts in knowledge-bases--(Paris, Capital, France) …

被引用次数：90 相关文章所有 5 个版本

[PDF] arxiv.org

Image retrieval from contextual descriptions

B Krojer, V Adlakha, V Vineet, Y Goyal, E Ponti… - arXiv preprint arXiv …, 2022 - arxiv.org

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …

被引用次数：34 相关文章所有 8 个版本

[PDF] acm.org

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

X Hu, L Gu, Q An, M Zhang, L Liu, K Kobayashi… - Proceedings of the 29th …, 2023 - dl.acm.org

To contribute to automating the medical vision-language model, we propose a novel Chest-
Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference …

被引用次数：14 相关文章所有 5 个版本

[PDF] arxiv.org

Modality-agnostic attention fusion for visual search with text feedback

E Dodds, J Culpepper, S Herdade, Y Zhang… - arXiv preprint arXiv …, 2020 - arxiv.org

Image retrieval with natural language feedback offers the promise of catalog search based
on fine-grained visual features that go beyond objects and binary attributes, facilitating real …

被引用次数：59 相关文章所有 2 个版本

[PDF] thecvf.com

Step Differences in Instructional Video

T Nagarajan, L Torresani - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Comparing a user video to a reference how-to video is a key requirement for AR/VR
technology delivering personalized assistance tailored to the user's progress. However …

被引用次数：1 相关文章所有 3 个版本

[PDF] thecvf.com

SAC: Semantic attention composition for text-conditioned image retrieval

S Jandial, P Badjatiya, P Chawla… - Proceedings of the …, 2022 - openaccess.thecvf.com

The ability to efficiently search for images is essential for improving the user experiences
across various products. Incorporating user feedback, via multi-modal inputs, to navigate …

被引用次数：42 相关文章所有 5 个版本

高级搜索

QQ 群