Semantics disentangling for cross-modal retrieval

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org

With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

被引用次数：24 相关文章所有 3 个版本

Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal Retrieval

M Jin, W Hu, L Zhu, X Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

To meet users' demands for video retrieval, text-video cross-modal retrieval technology
continues to evolve. Methods based on pre-trained models and transfer learning are widely …

被引用次数：2 相关文章

SDDA: A progressive self-distillation with decoupled alignment for multimodal image–text classification

X Chen, Q Shuai, F Hu, Y Cheng - Neurocomputing, 2025 - Elsevier

Multimodal image–text classification endeavors to deduce the correct category based on the
information encapsulated in image–text pairs. Despite the commendable performance …

相关文章所有 2 个版本

[PDF] arxiv.org

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

S Jiao, H Dong, Y Yin, Z Jie, Y Qian, Y Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent works in 3D multimodal learning have made remarkable progress. However,
typically 3D multimodal models are only capable of handling point clouds. Compared to the …

相关文章所有 2 个版本

[PDF] hal.science

Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering

X Wu, J Wu, L Zhu, L Senhadji… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Video question answering (VideoQA) is the challenging task of accurately responding to
natural language questions based on a given video. Most previous methods focus on …

Enhancing Text-Video Retrieval Performance with Low-Salient but Discriminative Objects

Y Zheng, B Huang, Z Chen, D Yu - IEEE Transactions on Image …, 2025 - ieeexplore.ieee.org

Text-video retrieval aims to establish a matching relationship between a video and its
corresponding text. However, previous works have primarily focused on salient video …

Chatting with interactive memory for text-based person retrieval

C He, S Li, Z Wang, H Chen, F Shen, X Xu - Multimedia Systems, 2025 - Springer

Text-based person retrieval aims to match a specific pedestrian image with textual
descriptions. Traditional approaches have largely focused on utilizing a “single-shot” query …

相关文章所有 2 个版本

PTAN: Principal Token-aware Adjacent Network for Compositional Temporal Grounding

Z Wei, X Jiang, Z Wang, F Shen, X Xu - Proceedings of the 2024 …, 2024 - dl.acm.org

Compositional temporal grounding (CTG) aims to localize the most relevant segment from
an untrimmed video based on a given natural language sentence, and the test samples for …

相关文章所有 2 个版本

[PDF] arxiv.org

Disentangled Noisy Correspondence Learning

Z Dang, M Luo, J Wang, C Jia, H Han, H Wan… - arXiv preprint arXiv …, 2024 - arxiv.org

Cross-modal retrieval is crucial in understanding latent correspondences across modalities.
However, existing methods implicitly assume well-matched training data, which is …

Text-Video Retrieval with Global-Local Semantic Consistent Learning

H Zhang, P Zeng, L Gao, J Song, Y Duan, X Lyu… - arXiv preprint arXiv …, 2024 - arxiv.org

Adapting large-scale image-text pre-training models, eg, CLIP, to the video domain
represents the current state-of-the-art for text-video retrieval. The primary approaches …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群