Deep multimodal representation learning: A survey

W Guo, J Wang, S Wang - Ieee Access, 2019 - ieeexplore.ieee.org
Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal …

Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval

X Xu, H Lu, J Song, Y Yang… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
Given a query instance from one modality (eg, image), cross-modal retrieval aims to find
semantically similar instances from another modality (eg, text). To perform cross-modal …

Graph embedding contrastive multi-modal representation learning for clustering

W Xia, T Wang, Q Gao, M Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Multi-modal clustering (MMC) aims to explore complementary information from diverse
modalities for clustering performance facilitating. This article studies challenging problems in …

Multi-modality associative bridging through memory: Speech sound recollected from face video

M Kim, J Hong, SJ Park, YM Ro - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
In this paper, we introduce a novel audio-visual multi-modal bridging framework that can
utilize both audio and visual information, even with uni-modal inputs. We exploit a memory …

Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited

X Xu, K Lin, Y Yang, A Hanjalic… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Recently, generative adversarial network (GAN) has shown its strong ability on modeling
data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the …

MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval

X Huang, Y Peng, M Yuan - IEEE transactions on cybernetics, 2018 - ieeexplore.ieee.org
Cross-modal retrieval has drawn wide interest for retrieval across different modalities (such
as text, image, video, audio, and 3-D model). However, existing methods based on a deep …

Dual alignment unsupervised domain adaptation for video-text retrieval

X Hao, W Zhang, D Wu, F Zhu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-text retrieval is an emerging stream in both computer vision and natural language
processing communities, which aims to find relevant videos given text queries. In this paper …

Learning cross-modal common representations by private–shared subspaces separation

X Xu, K Lin, L Gao, H Lu, HT Shen… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Due to the inconsistent distributions and representations of different modalities (eg, images
and texts), it is very challenging to correlate such heterogeneous data. A standard solution is …

Deep image-to-video adaptation and fusion networks for action recognition

Y Liu, Z Lu, J Li, T Yang, C Yao - IEEE Transactions on Image …, 2019 - ieeexplore.ieee.org
Existing deep learning methods for action recognition in videos require a large number of
labeled videos for training, which is labor-intensive and time-consuming. For the same …

Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model

JH Yeo, M Kim, J Choi, DH Kim… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip
movements. VSR is regarded as a challenging task because of the insufficient information …