Mamo: masked multimodal modeling for fine-grained vision-language representation learning

文章

学术资源搜索

获得 4 条结果（用时0.02秒）

我的图书馆

Mamo: masked multimodal modeling for fine-grained vision-language representation learning

在引用文章中搜索

[PDF] thecvf.com

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

被引用次数：28 相关文章所有 3 个版本

MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

F Shu, B Chen, Y Liao, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

We present a simple yet effective end-to-end Video-language Pre-training (VidLP)
framework, Masked Contrastive Video-language Pre-training (MAC), for video-text retrieval …

CSDNet: Contrastive Similarity Distillation Network for Multi-lingual Image-Text Retrieval

S Lu, L Guo, X He, X Zhu, J Liu, S Liu - International Conference on Image …, 2023 - Springer

Cross-modal image-text retrieval is a crucial task in the field of vision and language, aimed
at retrieving the relevant samples from one modality as per the given user expressed in …

A Unified Framework for Optimizing Video Corpus Retrieval and Temporal Answer Grounding: Fine-Grained Modality Alignment and Local-Global Optimization

S Cheng, Z Zhou, J Liu, J Ye, H Luo, Y Gu - CCF International Conference …, 2023 - Springer

Present advancements in digital content have resulted in an enhanced interest in video
understanding. The Temporal Answer Grounding in Video Corpus (TAGVC) aims to pinpoint …

高级搜索

QQ 群

Mamo: masked multimodal modeling for fine-grained vision-language representation learning

Context-aware alignment and mutual masking for 3d-language pre-training

MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

CSDNet: Contrastive Similarity Distillation Network for Multi-lingual Image-Text Retrieval

A Unified Framework for Optimizing Video Corpus Retrieval and Temporal Answer Grounding: Fine-Grained Modality Alignment and Local-Global Optimization

引用