Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges

K Bayoudh - Information Fusion, 2024 - Elsevier
In recent years, deep learning algorithms have rapidly revolutionized artificial intelligence,
particularly machine learning, enabling researchers and practitioners to extend previously …

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

Context-aware alignment and mutual masking for 3d-language pre-training

Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

Masked vision and language modeling for multi-modal representation learning

G Kwon, Z Cai, A Ravichandran, E Bas… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we study how to use masked signal modeling in vision and language (V+ L)
representation learning. Instead of developing masked language modeling (MLM) and …

Achieving cross modal generalization with multimodal unified representation

Y Xia, H Huang, J Zhu, Z Zhao - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper introduces a novel task called Cross Modal Generalization (CMG), which
addresses the challenge of learning a unified discrete representation from paired …

Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction

Y Xu, H Chen - Proceedings of the IEEE/CVF International …, 2023 - openaccess.thecvf.com
Survival prediction is a complicated ordinal regression task that aims to predict the ranking
risk of death, which generally benefits from the integration of histology and genomic data …

Learning to adapt clip for few-shot monocular depth estimation

X Hu, C Zhang, Y Zhang, B Hai… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Pre-trained Visual-Language Models (VLMs), such as CLIP, have shown enhanced
performance across a range of tasks that involve the integration of visual and linguistic …

Understanding and constructing latent modality structures in multi-modal representation learning

Q Jiang, C Chen, H Zhao, L Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive loss has been increasingly used in learning representations from multiple
modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly …

Efficient vision-language pretraining with visual concepts and hierarchical alignment

M Shukor, G Couairon, M Cord - arXiv preprint arXiv:2208.13628, 2022 - arxiv.org
Vision and Language Pretraining has become the prevalent approach for tackling
multimodal downstream tasks. The current trend is to move towards ever larger models and …