Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

Defense against adversarial attacks using feature scattering-based adversarial training

H Zhang, J Wang - Advances in neural information …, 2019 - proceedings.neurips.cc
We introduce a feature scattering-based adversarial training approach for improving model
robustness against adversarial attacks. Conventional adversarial training approaches …

Graph optimal transport for cross-domain alignment

L Chen, Z Gan, Y Cheng, L Li… - … on Machine Learning, 2020 - proceedings.mlr.press
Cross-domain alignment between two sets of entities (eg, objects in an image, words in a
sentence) is fundamental to both computer vision and natural language processing. Existing …

Mixgen: A new multi-modal data augmentation

X Hao, Y Zhu, S Appalaraju, A Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-
language pre-training, data is only augmented either for images or for text in previous works …

A unified computational framework for single-cell data integration with optimal transport

K Cao, Q Gong, Y Hong, L Wan - Nature Communications, 2022 - nature.com
Single-cell data integration can provide a comprehensive molecular view of cells. However,
how to integrate heterogeneous single-cell multi-omics as well as spatially resolved …

Gromov-wasserstein learning for graph matching and node embedding

H Xu, D Luo, H Zha, LC Duke - International conference on …, 2019 - proceedings.mlr.press
A novel Gromov-Wasserstein learning framework is proposed to jointly match (align) graphs
and learn embedding vectors for the associated graph nodes. Using Gromov-Wasserstein …

Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training

H Xue, Y Huang, B Liu, H Peng, J Fu… - Advances in Neural …, 2021 - proceedings.neurips.cc
Abstract Vision-Language Pre-training (VLP) aims to learn multi-modal representations from
image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion …