Object captioning and retrieval with natural language

DZ Chen, AX Chang, M Nießner - European conference on computer …, 2020 - Springer

We introduce the task of 3D object localization in RGB-D scans using natural language
descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free …

被引用次数：345 相关文章所有 5 个版本

[PDF] arxiv.org

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z Xiong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org

In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

被引用次数：93 相关文章所有 3 个版本

[PDF] thecvf.com

Coarse-to-fine reasoning for visual question answering

BX Nguyen, T Do, H Tran, E Tjiputra… - Proceedings of the …, 2022 - openaccess.thecvf.com

Bridging the semantic gap between image and question is an important step to improve the
accuracy of the Visual Question Answering (VQA) task. However, most of the existing VQA …

被引用次数：60 相关文章所有 10 个版本

[PDF] thecvf.com

Graph-based person signature for person re-identifications

BX Nguyen, BD Nguyen, T Do… - Proceedings of the …, 2021 - openaccess.thecvf.com

The task of person re-identification (ReID) is to match images of the same person over
multiple non-overlapping camera views. Due to the variations in visual factors, previous …

被引用次数：67 相关文章所有 10 个版本

[PDF] thecvf.com

Real-time 6dof pose relocalization for event cameras with stacked spatial lstm networks

A Nguyen, TT Do, DG Caldwell… - Proceedings of the …, 2019 - openaccess.thecvf.com

We present a new method to relocalize the 6DOF pose of an event camera solely based on
the event stream. Our method first creates the event image from a list of events that occurs in …

被引用次数：98 相关文章所有 12 个版本

[PDF] arxiv.org

A joint network for grasp detection conditioned on natural language commands

Y Chen, R Xu, Y Lin, PA Vela - 2021 IEEE International …, 2021 - ieeexplore.ieee.org

We consider the task of grasping a target object based on a natural language command
query. Previous work primarily focused on localizing the object given the query, which …

被引用次数：50 相关文章所有 5 个版本

[PDF] arxiv.org

Light-weight deformable registration using adversarial learning with distilling knowledge

MQ Tran, T Do, H Tran, E Tjiputra… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Deformable registration is a crucial step in many medical procedures such as image-guided
surgery and radiation therapy. Most recent learning-based methods focus on improving the …

被引用次数：28 相关文章所有 8 个版本

[PDF] arxiv.org

Autonomous navigation in complex environments with deep multimodal fusion network

A Nguyen, N Nguyen, K Tran… - 2020 IEEE/RSJ …, 2020 - ieeexplore.ieee.org

Autonomous navigation in complex environments is a crucial task in time-sensitive
scenarios such as disaster response or search and rescue. However, complex environments …

被引用次数：45 相关文章所有 7 个版本

[PDF] wiley.com Full View

Exploration of Cross‐Modal Text Generation Methods in Smart Justice

Y Zhang - Scientific Programming, 2021 - Wiley Online Library

With the development of modern science and technology, information technology has
brought great changes to many fields. Smart justice has become one of the increasing areas …

被引用次数：4 相关文章所有 7 个版本

Language conditioned multi-scale visual attention networks for visual grounding

H Yao, L Wang, C Cai, W Wang, Z Zhang… - Image and Vision …, 2024 - Elsevier

Visual grounding (VG) is a task that requires to locate a specific region in an image
according to a natural language expression. Existing efforts on the VG task are divided into …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群