Self-supervised correlation learning for cross-modal retrieval

Y Liu, J Wu, L Qu, T Gan, J Yin… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
IEEE Transactions on Multimedia, 2022ieeexplore.ieee.org
Cross-modal retrieval aims to retrieve relevant data from another modality when given a
query of one modality. Although most existing methods that rely on the label information of
multimedia data have achieved promising results, the performance benefiting from labeled
data comes at a high cost since labeling data often requires enormous labor resources,
especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal
learning is of crucial importance in real-world applications. In this paper, we propose a novel …
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果