Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

[HTML][HTML] Gpt-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier
In the field of autonomous vehicles (AVs), accurately discerning commander intent and
executing linguistic commands within a visual context presents a significant challenge. This …

Lexlip: Lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval

Z Luo, P Zhao, C Xu, X Geng, T Shen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from
the other modality. The conventional dense retrieval paradigm relies on encoding images …

Vision-and-language pretrained models: A survey

S Long, F Cao, SC Han, H Yang - arXiv preprint arXiv:2204.07356, 2022 - arxiv.org
Pretrained models have produced great success in both Computer Vision (CV) and Natural
Language Processing (NLP). This progress leads to learning joint representations of vision …

Rasa: Relation and sensitivity aware representation learning for text-based person search

Y Bai, M Cao, D Gao, Z Cao, C Chen, Z Fan… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-based person search aims to retrieve the specified person images given a textual
description. The key to tackling such a challenging task is to learn powerful multi-modal …