Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

F Shu, B Chen, Y Liao, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
We present a simple yet effective end-to-end Video-language Pre-training (VidLP)
framework, Masked Contrastive Video-language Pre-training (MAC), for video-text retrieval …

CSDNet: Contrastive Similarity Distillation Network for Multi-lingual Image-Text Retrieval

S Lu, L Guo, X He, X Zhu, J Liu, S Liu - International Conference on Image …, 2023 - Springer
Cross-modal image-text retrieval is a crucial task in the field of vision and language, aimed
at retrieving the relevant samples from one modality as per the given user expressed in …

A Unified Framework for Optimizing Video Corpus Retrieval and Temporal Answer Grounding: Fine-Grained Modality Alignment and Local-Global Optimization

S Cheng, Z Zhou, J Liu, J Ye, H Luo, Y Gu - CCF International Conference …, 2023 - Springer
Present advancements in digital content have resulted in an enhanced interest in video
understanding. The Temporal Answer Grounding in Video Corpus (TAGVC) aims to pinpoint …