P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human …
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and …
The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre- training (VLP). A potential solution lies in the combination of datasets from various language …
W Chai, G Wang - Applied Sciences, 2022 - mdpi.com
Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast …
J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …
Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained …
The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their …
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been …