Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

Learning visual representation from modality-shared contrastive language-image pre-training

H You, L Zhou, B Xiao, N Codella, Y Cheng… - … on Computer Vision, 2022 - Springer
Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn
transferable features for a range of downstream tasks by mapping multiple modalities into a …

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

Y Huang, J Tang, Z Chen, R Zhang… - arXiv preprint arXiv …, 2023 - researchgate.net
Large-scale vision-language pre-training has shown promising advances on various
downstream tasks and achieved significant performance in multi-modal understanding and …

De-diffusion makes text a strong cross-modal interface

C Wei, C Liu, S Qiao, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We demonstrate text as a strong cross-modal interface. Rather than relying on deep
embeddings to connect image and language as the interface representation our approach …

Equivariant similarity for vision-language foundation models

T Wang, K Lin, L Li, CC Lin, Z Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This study explores the concept of equivariance in vision-language foundation models
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer
Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

Does language help generalization in vision models?

B Devillers, B Choksi, R Bielawski… - arXiv preprint arXiv …, 2021 - arxiv.org
Vision models trained on multimodal datasets can benefit from the wide availability of large
image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot …