Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - … on computer vision …, 2021 - openaccess.thecvf.com
… To arrive at CC12M, we keep the image-text filtering intact, and relax the unimodal filters …
Second, in text-based filtering, we allow text between 3 and 256 words in the alt-text. We still …

[图书][B] Computer vision: algorithms and applications

R Szeliski - 2022 - books.google.com
… Yvan Leclerc and Pascal Fua, colleagues from my brief interlude at SRI International, gave
me new perspectives on alternative approaches to computer vision. During my six years of …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
… In this work, we leverage a dataset of over one billion noisy image alt-text pairs to scale
visual and vision-language representation learning. We follow the procedures described in the …

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - … on computer vision …, 2022 - openaccess.thecvf.com
… We collect 4 billion image and alt-text pairs following the same process as ALIGN [30],
with the same image-based filtering but simpler text-based filtering. Appendix L shows that …

Scaling up vision-language pre-training for image captioning

X Hu, Z Gan, J Wang, Z Yang, Z Liu… - … on computer vision …, 2022 - openaccess.thecvf.com
… We remove the alt-text if any of its unigrams cannot be found in the vocabulary. Afterwards,
… 200 million images, each corresponding to one alt-text. The word cloud of 200 most frequent …

Scene text detection and recognition: The deep learning era

S Long, X He, C Yao - International Journal of Computer Vision, 2021 - Springer
… With the rise and development of deep learning, computer vision has been tremendously
transformed and reshaped. As an important research area in computer vision, scene text

Adversarial representation learning for text-to-image matching

N Sarafianos, X Xu… - … on computer vision, 2019 - openaccess.thecvf.com
… For many computer vision applications such as image captioning, … and text level is an
essential yet challenging problem. Its challenges originate from the large word variance in the text

Text2live: Text-driven layered image and video editing

O Bar-Tal, D Ofri-Amar, R Fridman, Y Kasten… - … on computer vision, 2022 - Springer
We present a method for zero-shot, text-driven editing of natural images and videos. Given
an image or a video and a text prompt, our goal is to edit the appearance of existing objects (…

Improving vision-and-language navigation with image-text pairs from the web

A Majumdar, A Shrivastava, S Lee, P Anderson… - Computer Vision–ECCV …, 2020 - Springer
… As an alternative, we propose learning visual grounding from freely-available internet data,
alt-text captured in the Conceptual Captions dataset [24], containing around 3.3M image-text

Florence: A new foundation model for computer vision

L Yuan, D Chen, YL Chen, N Codella, X Dai… - arXiv preprint arXiv …, 2021 - arxiv.org
… shared representation, we introduce a new computer vision foundation model, Florence, to
… image-text data, our Florence model can be easily adapted for various computer vision tasks…