[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier
In the field of autonomous vehicles (AVs), accurately discerning commander intent and
executing linguistic commands within a visual context presents a significant challenge. This …

Reading-strategy inspired visual representation learning for text-to-video retrieval

J Dong, Y Wang, X Chen, X Qu, X Li… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
This paper aims for the task of text-to-video retrieval, where given a query in the form of a
natural-language sentence, it is asked to retrieve videos which are semantically relevant to …

Dual alignment unsupervised domain adaptation for video-text retrieval

X Hao, W Zhang, D Wu, F Zhu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-text retrieval is an emerging stream in both computer vision and natural language
processing communities, which aims to find relevant videos given text queries. In this paper …

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

X Qin, L Li, F Hao, M Ge, G Pang - Information Processing & Management, 2024 - Elsevier
Image–text retrieval plays a considerable role in associating vision and language. Existing
mainstream approaches focus on fine-grained alignment while ignoring the influence of …

Uncertainty-aware alignment network for cross-domain video-text retrieval

X Hao, W Zhang - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Video-text retrieval is an important but challenging research task in the multimedia
community. In this paper, we address the challenge task of Unsupervised Domain …

Domain adaptive twin support vector machine learning using privileged information

Y Li, H Sun, W Yan - Neurocomputing, 2022 - Elsevier
In the fields of computer vision and machine learning, domain adaptation has been
extensively studied and the main challenge in the case is how to transform the existing …

Multi-level feature disentanglement network for cross-dataset face forgery detection

Z Fu, X Chen, D Liu, X Qu, J Dong, X Zhang… - Image and Vision …, 2023 - Elsevier
Synthesizing videos with forged faces is a fundamental yet important safety-critical task that
has caused severe security issues in recent years. Although many existing face forgery …

Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal Retrieval

L Chen, Z Deng, L Liu, S Yin - IEEE Transactions on Circuits …, 2024 - ieeexplore.ieee.org
Video–text cross-modal retrieval (VTR) is more natural and challenging than image–text
retrieval, which has attracted increasing interest from researchers in recent years. To align …

FeatInter: exploring fine-grained object features for video-text retrieval

B Liu, Q Zheng, Y Wang, M Zhang, J Dong, X Wang - Neurocomputing, 2022 - Elsevier
In this paper, we target the challenging task of video-text retrieval. The common way for this
task is to learn a text-video joint embedding space by cross-modal representation learning …

Unpaired referring expression grounding via bidirectional cross-modal matching

H Shi, M Hayat, J Cai - Neurocomputing, 2023 - Elsevier
Referring expression grounding is an important and challenging task in computer vision. To
avoid the laborious annotation in conventional referring grounding, unpaired referring …