Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal Retrieval

M Jin, W Hu, L Zhu, X Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
To meet users' demands for video retrieval, text-video cross-modal retrieval technology
continues to evolve. Methods based on pre-trained models and transfer learning are widely …

SDDA: A progressive self-distillation with decoupled alignment for multimodal image–text classification

X Chen, Q Shuai, F Hu, Y Cheng - Neurocomputing, 2025 - Elsevier
Multimodal image–text classification endeavors to deduce the correct category based on the
information encapsulated in image–text pairs. Despite the commendable performance …

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

S Jiao, H Dong, Y Yin, Z Jie, Y Qian, Y Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent works in 3D multimodal learning have made remarkable progress. However,
typically 3D multimodal models are only capable of handling point clouds. Compared to the …

Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering

X Wu, J Wu, L Zhu, L Senhadji… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Video question answering (VideoQA) is the challenging task of accurately responding to
natural language questions based on a given video. Most previous methods focus on …

Enhancing Text-Video Retrieval Performance with Low-Salient but Discriminative Objects

Y Zheng, B Huang, Z Chen, D Yu - IEEE Transactions on Image …, 2025 - ieeexplore.ieee.org
Text-video retrieval aims to establish a matching relationship between a video and its
corresponding text. However, previous works have primarily focused on salient video …

Chatting with interactive memory for text-based person retrieval

C He, S Li, Z Wang, H Chen, F Shen, X Xu - Multimedia Systems, 2025 - Springer
Text-based person retrieval aims to match a specific pedestrian image with textual
descriptions. Traditional approaches have largely focused on utilizing a “single-shot” query …

PTAN: Principal Token-aware Adjacent Network for Compositional Temporal Grounding

Z Wei, X Jiang, Z Wang, F Shen, X Xu - Proceedings of the 2024 …, 2024 - dl.acm.org
Compositional temporal grounding (CTG) aims to localize the most relevant segment from
an untrimmed video based on a given natural language sentence, and the test samples for …

Disentangled Noisy Correspondence Learning

Z Dang, M Luo, J Wang, C Jia, H Han, H Wan… - arXiv preprint arXiv …, 2024 - arxiv.org
Cross-modal retrieval is crucial in understanding latent correspondences across modalities.
However, existing methods implicitly assume well-matched training data, which is …

Text-Video Retrieval with Global-Local Semantic Consistent Learning

H Zhang, P Zeng, L Gao, J Song, Y Duan, X Lyu… - arXiv preprint arXiv …, 2024 - arxiv.org
Adapting large-scale image-text pre-training models, eg, CLIP, to the video domain
represents the current state-of-the-art for text-video retrieval. The primary approaches …