Improved Visual Grounding through Self-Consistent Explanations

R He, P Cascante-Bonilla, Z Yang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-and-language models trained to match images with text can be combined with visual
explanation methods to point to the locations of specific objects in an image. Our work …

Relation-aware instance refinement for weakly supervised visual grounding

Y Liu, B Wan, L Ma, X He - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com
Visual grounding, which aims to build a correspondence between visual objects and their
language entities, plays a key role in cross-modal scene understanding. One promising and …

Improving visual grounding by encouraging consistent gradient-based explanations

Z Yang, K Kafle, F Dernoncourt… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a margin-based loss for tuning joint vision-language models so that their
gradient-based explanations are consistent with region-level annotations provided by …

Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency

J Lee, S Lee, J Nam, S Yu, J Do… - Proceedings of the …, 2023 - openaccess.thecvf.com
Referring image segmentation (RIS) aims to localize the object in an image referred by a
natural language expression. Most previous studies learn RIS with a large-scale dataset …

Multi-level multimodal common semantic space for image-phrase grounding

H Akbari, S Karaman, S Bhargava… - Proceedings of the …, 2019 - openaccess.thecvf.com
We address the problem of phrase grounding by learning a multi-level common semantic
space shared by the textual and visual modalities. We exploit multiple levels of feature maps …

Box-based refinement for weakly supervised and unsupervised localization tasks

E Gomel, T Shaharbany, L Wolf - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
It has been established that training a box-based detector network can enhance the
localization performance of weakly supervised and unsupervised methods. Moreover, we …

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

T Shaharabany, Y Tewel, L Wolf - Advances in Neural …, 2022 - proceedings.neurips.cc
Given an input image, and nothing else, our method returns the bounding boxes of objects
in the image and phrases that describe the objects. This is achieved within an open world …

Similarity maps for self-training weakly-supervised phrase grounding

T Shaharabany, L Wolf - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
A phrase grounding model receives an input image and a text phrase and outputs a suitable
localization map. We present an effective way to refine a phrase ground model by …

Encoder-decoder based long short-term memory (lstm) model for video captioning

S Adewale, T Ige, BH Matti - arXiv preprint arXiv:2401.02052, 2023 - arxiv.org
This work demonstrates the implementation and use of an encoder-decoder model to
perform a many-to-many mapping of video data to text captions. The many-to-many mapping …

Weakly supervised moment localization with decoupled consistent concept prediction

F Ma, L Zhu, Y Yang - International Journal of Computer Vision, 2022 - Springer
Localizing moments in a video via natural language queries is a challenging task where
models are trained to identify the start and the end timestamps of the moment in a video …