Look before you leap: Learning landmark features for one-stage visual grounding

X Li, C Wen, Y Hu, Z Yuan… - IEEE Geoscience and …, 2024 - ieeexplore.ieee.org

The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-
4) have sparked a wave of interest and research in the field of large language models …

被引用次数：46 相关文章所有 5 个版本

[PDF] arxiv.org

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022 - Springer

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …

被引用次数：139 相关文章所有 5 个版本

[PDF] thecvf.com

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

被引用次数：114 相关文章所有 7 个版本

[PDF] thecvf.com

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

被引用次数：97 相关文章所有 10 个版本

MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

C Chen, D Han, CC Chang - Pattern Recognition, 2024 - Elsevier

Transformer and its variants have become the preferred option for multimodal vision-
language paradigms. However, they struggle with tasks that demand high-dependency …

被引用次数：91 相关文章所有 3 个版本

[PDF] thecvf.com

Joint visual grounding and tracking with natural language specification

L Zhou, Z Zhou, K Mao, Z He - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Tracking by natural language specification aims to locate the referred target in a sequence
based on the natural language description. Existing algorithms solve this issue in two steps …

被引用次数：51 相关文章所有 5 个版本

[PDF] arxiv.org

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z Xiong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org

In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

被引用次数：76 相关文章所有 3 个版本

[PDF] thecvf.com

Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding

J Ye, J Tian, M Yan, X Yang, X Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Visual grounding focuses on establishing fine-grained alignment between vision and natural
language, which has essential applications in multimodal reasoning systems. Existing …

被引用次数：62 相关文章所有 5 个版本

[PDF] arxiv.org

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

W Zhang, M Cai, T Zhang, Y Zhuang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Multimodal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …

被引用次数：42 相关文章所有 3 个版本

[PDF] thecvf.com

Iterative robust visual grounding with masked reference based centerpoint supervision

M Li, C Wang, W Feng, S Lyu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Visual Grounding (VG) aims at localizing target objects from an image based on given
expressions and has made significant progress with the development of detection and vision …

被引用次数：5 相关文章所有 6 个版本

高级搜索

QQ 群