相关文章- 学术资源搜索

Robust cross-modal representation learning with progressive self-distillation

A Andonian, S Chen, R Hamid - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The learning objective of vision-language approach of CLIP does not effectively account for
the noisy many-to-many correspondences found in web-harvested image captioning …

被引用次数：43 相关文章所有 6 个版本

[PDF] arxiv.org

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Z Huang, Z Zeng, B Liu, D Fu, J Fu - arXiv preprint arXiv:2004.00849, 2020 - arxiv.org

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers
that jointly learn visual and language embedding in a unified end-to-end framework. We aim …

被引用次数：418 相关文章所有 3 个版本

[PDF] thecvf.com

Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

被引用次数：110 相关文章所有 6 个版本

[PDF] arxiv.org

Fusion of detected objects in text for visual question answering

C Alberti, J Ling, M Collins, D Reitter - arXiv preprint arXiv:1908.05054, 2019 - arxiv.org

To advance models of multimodal context, we introduce a simple yet powerful neural
architecture for data that combines vision and natural language. The" Bounding Boxes in …

被引用次数：203 相关文章所有 6 个版本

[PDF] aclanthology.org

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org

Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

被引用次数：83 相关文章所有 3 个版本

[PDF] mlr.press

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press

Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

被引用次数：2821 相关文章所有 6 个版本

[PDF] mlr.press

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press

We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

被引用次数：124 相关文章所有 7 个版本

[PDF] arxiv.org

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

J Wang, H Wang, J Deng, W Wu, D Zhang - arXiv preprint arXiv …, 2021 - arxiv.org

While large scale pre-training has achieved great achievements in bridging the gap
between vision and language, it still faces several challenges. First, the cost for pre-training …

被引用次数：35 相关文章所有 5 个版本

[PDF] arxiv.org

Visually-augmented language modeling

W Wang, L Dong, H Cheng, H Song, X Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

Human language is grounded on multimodal knowledge including visual knowledge like
colors, sizes, and shapes. However, current large-scale pre-trained language models rely …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

WenLan: Bridging vision and language by large-scale multi-modal pre-training

Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang… - arXiv preprint arXiv …, 2021 - arxiv.org

Multi-modal pre-training models have been intensively explored to bridge vision and
language in recent years. However, most of them explicitly model the cross-modal …

被引用次数：117 相关文章所有 2 个版本

高级搜索

QQ 群

Robust cross-modal representation learning with progressive self-distillation

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic

Fusion of detected objects in text for visual question answering

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

Scaling up visual and vision-language representation learning with noisy text supervision

Grounding language models to images for multimodal inputs and outputs

Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling

Visually-augmented language modeling

WenLan: Bridging vision and language by large-scale multi-modal pre-training

引用