相关文章- 学术资源搜索

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

被引用次数：48 相关文章所有 7 个版本

[PDF] thecvf.com

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

被引用次数：398 相关文章所有 6 个版本

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：124 相关文章所有 3 个版本

[PDF] arxiv.org

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

被引用次数：73 相关文章所有 4 个版本

[PDF] arxiv.org

Learning visual representation from modality-shared contrastive language-image pre-training

H You, L Zhou, B Xiao, N Codella, Y Cheng… - … on Computer Vision, 2022 - Springer

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn
transferable features for a range of downstream tasks by mapping multiple modalities into a …

被引用次数：38 相关文章所有 5 个版本

[PDF] researchgate.net

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

Y Huang, J Tang, Z Chen, R Zhang… - arXiv preprint arXiv …, 2023 - researchgate.net

Large-scale vision-language pre-training has shown promising advances on various
downstream tasks and achieved significant performance in multi-modal understanding and …

被引用次数：13 相关文章

[PDF] thecvf.com

De-diffusion makes text a strong cross-modal interface

C Wei, C Liu, S Qiao, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We demonstrate text as a strong cross-modal interface. Rather than relying on deep
embeddings to connect image and language as the interface representation our approach …

被引用次数：2 相关文章所有 3 个版本

[PDF] thecvf.com

Equivariant similarity for vision-language foundation models

T Wang, K Lin, L Li, CC Lin, Z Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This study explores the concept of equivariance in vision-language foundation models
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …

被引用次数：19 相关文章所有 5 个版本

[PDF] arxiv.org

Uniter: Universal image-text representation learning

YC Chen, L Li, L Yu, A El Kholy, F Ahmed… - European conference on …, 2020 - Springer

Joint image-text embedding is the bedrock for most Vision-and-Language (V+ L) tasks,
where multimodality inputs are simultaneously processed for joint visual and textual …

被引用次数：1998 相关文章所有 7 个版本

[PDF] arxiv.org

Does language help generalization in vision models?

B Devillers, B Choksi, R Bielawski… - arXiv preprint arXiv …, 2021 - arxiv.org

Vision models trained on multimodal datasets can benefit from the wide availability of large
image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot …

被引用次数：22 相关文章所有 6 个版本

高级搜索

QQ 群

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Regionclip: Region-based language-image pretraining

Video-llava: Learning united visual representation by alignment before projection

Lavis: A library for language-vision intelligence

Learning visual representation from modality-shared contrastive language-image pre-training

[PDF][PDF] Structure-clip: Enhance multi-modal language representations with structure knowledge

De-diffusion makes text a strong cross-modal interface

Equivariant similarity for vision-language foundation models

Uniter: Universal image-text representation learning

Does language help generalization in vision models?

引用