Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Violet: End-to-end video-language transformers with masked visual-token modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang… - arXiv preprint arXiv …, 2021 - arxiv.org
A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …

Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias

Z Wan, C Liu, M Zhang, J Fu, B Wang… - Advances in …, 2024 - proceedings.neurips.cc
The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-
training (VLP). A potential solution lies in the combination of datasets from various language …

Deep vision multimodal learning: Methodology, benchmark, and trend

W Chai, G Wang - Applied Sciences, 2022 - mdpi.com
Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

Class-agnostic object detection with multi-modal transformer

M Maaz, H Rasheed, S Khan, FS Khan… - European conference on …, 2022 - Springer
What constitutes an object? This has been a long-standing question in computer vision.
Towards this goal, numerous learning-free and learning-based approaches have been …