Vision-language models in remote sensing: Current progress and future trends

X Li, C Wen, Y Hu, Z Yuan… - IEEE Geoscience and …, 2024 - ieeexplore.ieee.org
The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-
4) have sparked a wave of interest and research in the field of large language models …

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022 - Springer
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

C Chen, D Han, CC Chang - Pattern Recognition, 2024 - Elsevier
Transformer and its variants have become the preferred option for multimodal vision-
language paradigms. However, they struggle with tasks that demand high-dependency …

Joint visual grounding and tracking with natural language specification

L Zhou, Z Zhou, K Mao, Z He - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Tracking by natural language specification aims to locate the referred target in a sequence
based on the natural language description. Existing algorithms solve this issue in two steps …

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z Xiong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org
In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding

J Ye, J Tian, M Yan, X Yang, X Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding focuses on establishing fine-grained alignment between vision and natural
language, which has essential applications in multimodal reasoning systems. Existing …

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

W Zhang, M Cai, T Zhang, Y Zhuang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Multimodal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …

Iterative robust visual grounding with masked reference based centerpoint supervision

M Li, C Wang, W Feng, S Lyu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual Grounding (VG) aims at localizing target objects from an image based on given
expressions and has made significant progress with the development of detection and vision …