A survey on graph neural networks and graph transformers in computer vision: a task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu, S Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (\emph {eg,} social …

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

Transvg: End-to-end visual grounding with transformers

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …

Referring multi-object tracking

D Wu, W Han, T Wang, X Dong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing referring understanding tasks tend to involve the detection of a single text-referred
object. In this paper, we propose a new and general referring understanding task, termed …

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

Similarity reasoning and filtration for image-text matching

H Diao, Y Zhang, L Ma, H Lu - Proceedings of the AAAI conference on …, 2021 - ojs.aaai.org
Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …

Graph representation learning and its applications: a survey

VT Hoang, HJ Jeon, ES You, Y Yoon, S Jung, OJ Lee - Sensors, 2023 - mdpi.com
Graphs are data structures that effectively represent relational data in the real world. Graph
representation learning is a significant task since it could facilitate various downstream …

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …