Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning …
This monograph surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches …
SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different …
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short-and long-range visual dependencies …
C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human …
Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism …
S Zhang, Y Xu, N Usuyama, J Bagga… - arXiv preprint arXiv …, 2023 - researchgate.net
Contrastive pretraining on parallel image-text data has attained great success in vision- language processing (VLP), as exemplified by CLIP and related methods. However, prior …
W Xia, Y Yang, JH Xue, B Wu - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components …
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning …