Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks

A survey on graph neural networks and graph transformers in computer vision: a task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu, S Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (\emph {eg,} social …

被引用次数：45 相关文章所有 3 个版本

[PDF] thecvf.com

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

被引用次数：87 相关文章所有 6 个版本

[PDF] thecvf.com

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

被引用次数：67 相关文章所有 7 个版本

[PDF] thecvf.com

Transvg: End-to-end visual grounding with transformers

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …

被引用次数：281 相关文章所有 6 个版本

[PDF] thecvf.com

Referring multi-object tracking

D Wu, W Han, T Wang, X Dong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing referring understanding tasks tend to involve the detection of a single text-referred
object. In this paper, we propose a new and general referring understanding task, termed …

被引用次数：49 相关文章所有 5 个版本

[PDF] arxiv.org

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

被引用次数：93 相关文章所有 7 个版本

[PDF] aaai.org

Similarity reasoning and filtration for image-text matching

H Diao, Y Zhang, L Ma, H Lu - Proceedings of the AAAI conference on …, 2021 - ojs.aaai.org

Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …

被引用次数：285 相关文章所有 9 个版本

[PDF] mdpi.com

Graph representation learning and its applications: a survey

VT Hoang, HJ Jeon, ES You, Y Yoon, S Jung, OJ Lee - Sensors, 2023 - mdpi.com

Graphs are data structures that effectively represent relational data in the real world. Graph
representation learning is a significant task since it could facilitate various downstream …

被引用次数：20 相关文章所有 9 个版本

[PDF] thecvf.com

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

被引用次数：87 相关文章所有 7 个版本

[PDF] thecvf.com

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

被引用次数：82 相关文章所有 10 个版本

高级搜索

QQ 群