With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better …
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual …
H Zhang, J Wang - Advances in neural information …, 2019 - proceedings.neurips.cc
We introduce a feature scattering-based adversarial training approach for improving model robustness against adversarial attacks. Conventional adversarial training approaches …
Cross-domain alignment between two sets of entities (eg, objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing …
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision- language pre-training, data is only augmented either for images or for text in previous works …
Single-cell data integration can provide a comprehensive molecular view of cells. However, how to integrate heterogeneous single-cell multi-omics as well as spatially resolved …
H Xu, D Luo, H Zha, LC Duke - International conference on …, 2019 - proceedings.mlr.press
A novel Gromov-Wasserstein learning framework is proposed to jointly match (align) graphs and learn embedding vectors for the associated graph nodes. Using Gromov-Wasserstein …
Abstract Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion …