Learning unsupervised visual grounding through semantic self-supervision

R He, P Cascante-Bonilla, Z Yang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-and-language models trained to match images with text can be combined with visual
explanation methods to point to the locations of specific objects in an image. Our work …

被引用次数：6 相关文章所有 3 个版本

[PDF] thecvf.com

Relation-aware instance refinement for weakly supervised visual grounding

Y Liu, B Wan, L Ma, X He - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com

Visual grounding, which aims to build a correspondence between visual objects and their
language entities, plays a key role in cross-modal scene understanding. One promising and …

被引用次数：59 相关文章所有 7 个版本

[PDF] thecvf.com

Improving visual grounding by encouraging consistent gradient-based explanations

Z Yang, K Kafle, F Dernoncourt… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a margin-based loss for tuning joint vision-language models so that their
gradient-based explanations are consistent with region-level annotations provided by …

被引用次数：21 相关文章所有 7 个版本

[PDF] thecvf.com

Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency

J Lee, S Lee, J Nam, S Yu, J Do… - Proceedings of the …, 2023 - openaccess.thecvf.com

Referring image segmentation (RIS) aims to localize the object in an image referred by a
natural language expression. Most previous studies learn RIS with a large-scale dataset …

被引用次数：13 相关文章所有 4 个版本

[PDF] thecvf.com

Multi-level multimodal common semantic space for image-phrase grounding

H Akbari, S Karaman, S Bhargava… - Proceedings of the …, 2019 - openaccess.thecvf.com

We address the problem of phrase grounding by learning a multi-level common semantic
space shared by the textual and visual modalities. We exploit multiple levels of feature maps …

被引用次数：88 相关文章所有 10 个版本

[PDF] thecvf.com

Box-based refinement for weakly supervised and unsupervised localization tasks

E Gomel, T Shaharbany, L Wolf - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

It has been established that training a box-based detector network can enhance the
localization performance of weakly supervised and unsupervised methods. Moreover, we …

被引用次数：5 相关文章所有 5 个版本

[PDF] neurips.cc

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

T Shaharabany, Y Tewel, L Wolf - Advances in Neural …, 2022 - proceedings.neurips.cc

Given an input image, and nothing else, our method returns the bounding boxes of objects
in the image and phrases that describe the objects. This is achieved within an open world …

被引用次数：16 相关文章所有 6 个版本

[PDF] thecvf.com

Similarity maps for self-training weakly-supervised phrase grounding

T Shaharabany, L Wolf - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

A phrase grounding model receives an input image and a text phrase and outputs a suitable
localization map. We present an effective way to refine a phrase ground model by …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Encoder-decoder based long short-term memory (lstm) model for video captioning

S Adewale, T Ige, BH Matti - arXiv preprint arXiv:2401.02052, 2023 - arxiv.org

This work demonstrates the implementation and use of an encoder-decoder model to
perform a many-to-many mapping of video data to text captions. The many-to-many mapping …

被引用次数：10 相关文章所有 2 个版本

Weakly supervised moment localization with decoupled consistent concept prediction

F Ma, L Zhu, Y Yang - International Journal of Computer Vision, 2022 - Springer

Localizing moments in a video via natural language queries is a challenging task where
models are trained to identify the start and the end timestamps of the moment in a video …

被引用次数：15 相关文章所有 5 个版本

高级搜索

QQ 群