Unifying two-stream encoders with transformers for cross-modal retrieval

Y Bin, H Li, Y Xu, X Xu, Y Yang, HT Shen - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Most existing cross-modal retrieval methods employ two-stream encoders with different
architectures for images and texts, eg, CNN for images and RNN/Transformer for texts. Such …

Frequency information disentanglement network for video-based person re-identification

L Liu, X Yang, N Wang, X Gao - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Recently, most video-based person re-identification (Re-ID) methods adopt complex model
or multi-scaled information to explore more discriminative spatio-temporal clues, thus …

Hsmh: A hierarchical sequence multi-hop reasoning model with reinforcement learning

D Wang, B Li, B Song, C Chen… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The incompleteness of knowledge graphs (KGs) negatively impacts the performance of KGs
in downstream applications (eg, recommendation systems and information retrieval). This …

A Mutually Textual and Visual Refinement Network for Image-Text Matching

S Pang, Y Zeng, J Zhao, J Xue - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Image-text matching is vital important in the field of multi-modal intelligence. Recently, it is
advocated in a way that decomposes images and texts into local fragments and followed by …

Multimodal Progressive Modulation Network for Micro-video Multi-label Classification

P Jing, X Zhao, F Fan, F Yang, Y Li… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally
include diverse multimodal cues. However, in pursuit of consistent representations, existing …

Spatial-Channel Attention Transformer with Pseudo Regions for Remote Sensing Image-Text Retrieval

D Wu, H Li, X Hou, C Xu, G Cheng… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Recently, remote sensing image-text retrieval (RSITR) has received significant attention due
to its flexible query form and effective management of remote sensing images. However …

ITContrast: contrastive learning with hard negative synthesis for image-text matching

F Wu, Q Wang, Z Wang, S Yu, Y Li, B Zhang… - The Visual Computer, 2024 - Springer
Image-text matching aims to bridge vision and language so as to match the instance of one
modality with the instance of another modality. Recent years have seen considerable …