Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving

K Chitta, A Prakash, B Jaeger, Z Yu… - … on Pattern Analysis …, 2022 - ieeexplore.ieee.org
How should we integrate representations from complementary sensors for autonomous
driving? Geometry-based fusion has shown promise for perception (eg, object detection …

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org
Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

Multi-modal fusion transformer for end-to-end autonomous driving

A Prakash, K Chitta, A Geiger - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
How should representations from complementary sensors be integrated for autonomous
driving? Geometry-based sensor fusion has shown great promise for perception tasks such …

Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

[HTML][HTML] Pre-trained models: Past, present and future

X Han, Z Zhang, N Ding, Y Gu, X Liu, Y Huo, J Qiu… - AI Open, 2021 - Elsevier
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …