Complementarity-aware space learning for video-text retrieval

J Zhu, P Zeng, L Gao, G Li, D Liao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In general, videos are powerful at recording physical patterns (eg, spatial layout) while texts
are great at describing abstract symbols (eg, emotion). When video and text are used in …

Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild

X Zhang, M Li, S Lin, H Xu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Dynamic expression recognition in the wild is a challenging task due to various obstacles,
including low light condition, non-positive face, and face occlusion. Purely vision-based …

An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval

W Xiong, Z Xiong, Y Cui, L Huang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
With the increasing number of remote sensing ship images, it's vitally important to search for
the ship objects that users are interested in from the remote sensing image big data. The …

Datasets, clues and state-of-the-arts for multimedia forensics: An extensive review

A Yadav, DK Vishwakarma - Expert Systems with Applications, 2024 - Elsevier
With the large chunks of social media data being created daily and the parallel rise of
realistic multimedia tampering methods, detecting and localising tampering in images and …

Enhanced semantic similarity learning framework for image-text matching

K Zhang, B Hu, H Zhang, Z Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Image-text matching is a fundamental task to bridge vision and language. The critical
challenge lies in accurately learning the semantic similarity between these two …

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

X Dong, Q Guo, T Gan, Q Wang, J Wu… - … on Circuits and …, 2023 - ieeexplore.ieee.org
We present a framework for learning cross-modal video representations by directly pre-
training on raw data to facilitate various downstream video-text tasks. Our main contributions …

Video Visualization and Visual Analytics: A Task-Based and Application-Driven Investigation

W Xia, G Sun, T Li, B Chang, J Tang… - … on Circuits and …, 2024 - ieeexplore.ieee.org
Video data refers to digital information in the form of a series of frames or images
representing continuous motion captured by a video recording device. In various domains …

Learning from noisy correspondence with tri-partition for cross-modal matching

Z Feng, Z Zeng, C Guo, Z Li, L Hu - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Due to high labeling cost, it is inevitable to introduce a certain proportion of noisy
correspondence into visual-text datasets, resulting in poor model robustness for cross-modal …

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and-Agnostic Representations

D Yang, M Li, L Qu, K Yang, P Zhai… - … on Circuits and …, 2024 - ieeexplore.ieee.org
Understanding human intentions (eg, emotions) from videos has received considerable
attention recently. Video streams generally constitute a blend of temporal data stemming …

UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval

H Zhang, P Zeng, L Gao, J Song… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Prompt tuning, an emerging parameter-efficient strategy, leverages the powerful knowledge
of large-scale pre-trained image-text models (eg, CLIP) to swiftly adapt to downstream tasks …