X Xu, H Lu, J Song, Y Yang… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
Given a query instance from one modality (eg, image), cross-modal retrieval aims to find semantically similar instances from another modality (eg, text). To perform cross-modal …
W Xia, T Wang, Q Gao, M Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Multi-modal clustering (MMC) aims to explore complementary information from diverse modalities for clustering performance facilitating. This article studies challenging problems in …
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …
X Xu, K Lin, Y Yang, A Hanjalic… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the …
X Huang, Y Peng, M Yuan - IEEE transactions on cybernetics, 2018 - ieeexplore.ieee.org
Cross-modal retrieval has drawn wide interest for retrieval across different modalities (such as text, image, video, audio, and 3-D model). However, existing methods based on a deep …
X Hao, W Zhang, D Wu, F Zhu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-text retrieval is an emerging stream in both computer vision and natural language processing communities, which aims to find relevant videos given text queries. In this paper …
X Xu, K Lin, L Gao, H Lu, HT Shen… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Due to the inconsistent distributions and representations of different modalities (eg, images and texts), it is very challenging to correlate such heterogeneous data. A standard solution is …
Y Liu, Z Lu, J Li, T Yang, C Yao - IEEE Transactions on Image …, 2019 - ieeexplore.ieee.org
Existing deep learning methods for action recognition in videos require a large number of labeled videos for training, which is labor-intensive and time-consuming. For the same …
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information …