Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer
Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

Robust cross-modal representation learning with progressive self-distillation

A Andonian, S Chen, R Hamid - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The learning objective of vision-language approach of CLIP does not effectively account for
the noisy many-to-many correspondences found in web-harvested image captioning …

Cross-lingual cross-modal consolidation for effective multilingual video corpus moment retrieval

J Liu, T Yu, H Peng, M Sun, P Li - Findings of the Association for …, 2022 - aclanthology.org
Existing multilingual video corpus moment retrieval (mVCMR) methods are mainly based on
a two-stream structure. The visual stream utilizes the visual content in the video to estimate …

Balance act: Mitigating hubness in cross-modal retrieval with query and gallery banks

Y Wang, X Jian, B Xue - arXiv preprint arXiv:2310.11612, 2023 - arxiv.org
In this work, we present a post-processing solution to address the hubness problem in cross-
modal retrieval, a phenomenon where a small number of gallery data points are frequently …

Cross-probe BERT for fast cross-modal search

T Yu, H Fei, P Li - Proceedings of the 45th International ACM SIGIR …, 2022 - dl.acm.org
Owing to the effectiveness of cross-modal attentions, text-vision BERT models have
achieved excellent performance in text-image retrieval. Nevertheless, cross-modal …

U-BERT for fast and scalable text-image retrieval

T Yu, H Fei, P Li - Proceedings of the 2022 ACM SIGIR International …, 2022 - dl.acm.org
Exploiting cross-modal attention on image region features and text features, cross-modal
BERT models have achieved higher accuracy than the embedding-based methods in cross …

Towards fast and accurate image-text retrieval with self-supervised fine-grained alignment

J Zhuang, J Yu, Y Ding, X Qu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Image-text retrieval requires the system to bridge the heterogenous gap between vision and
language for accurate retrieval while keeping the network lightweight-enough for efficient …

Multi-scale multi-modal dictionary BERT for effective text-image retrieval in multimedia advertising

T Yu, J Liu, Z Jin, Y Yang, H Fei, P Li - Proceedings of the 31st ACM …, 2022 - dl.acm.org
Visual content in multimedia advertising effectively attracts the customer's attention. Search-
based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap …

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Z Li, L Zhang, K Zhang, Y Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Image-text retrieval is a fundamental task in bridging the semantics between vision and
language. The key challenge lies in accurately and efficiently learning the semantic …

Texture BERT for cross-modal texture image retrieval

Z Xu, T Yu, P Li - Proceedings of the 31st ACM International Conference …, 2022 - dl.acm.org
We propose Texture BERT, a model describing visual attributes of texture using natural
language. To capture the rich details in texture images, we propose a group-wise compact …