COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：359 相关文章所有 9 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：18 相关文章所有 2 个版本

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

被引用次数：40 相关文章所有 6 个版本

[PDF] thecvf.com

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

被引用次数：67 相关文章所有 6 个版本

[PDF] thecvf.com

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

被引用次数：25 相关文章所有 3 个版本

[PDF] arxiv.org

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org

The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

被引用次数：39 相关文章所有 4 个版本

[HTML] sciencedirect.com

[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier

In the field of autonomous vehicles (AVs), accurately discerning commander intent and
executing linguistic commands within a visual context presents a significant challenge. This …

被引用次数：14 相关文章所有 4 个版本

[PDF] thecvf.com

Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval

Z Luo, P Zhao, C Xu, X Geng, T Shen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from
the other modality. The conventional dense retrieval paradigm relies on encoding images …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Vision-and-language pretrained models: A survey

S Long, F Cao, SC Han, H Yang - arXiv preprint arXiv:2204.07356, 2022 - arxiv.org

Pretrained models have produced great success in both Computer Vision (CV) and Natural
Language Processing (NLP). This progress leads to learning joint representations of vision …

被引用次数：47 相关文章所有 8 个版本

[PDF] arxiv.org

Rasa: Relation and sensitivity aware representation learning for text-based person search

Y Bai, M Cao, D Gao, Z Cao, C Chen, Z Fan… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-based person search aims to retrieve the specified person images given a textual
description. The key to tackling such a challenging task is to learn powerful multi-modal …

被引用次数：28 相关文章所有 3 个版本

高级搜索

QQ 群