相关文章- 学术资源搜索

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

被引用次数：15 相关文章所有 3 个版本

[PDF] thecvf.com

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The limited availability of annotated data often hinders real-world applications of machine
learning. To efficiently learn from small quantities of multimodal data, we leverage the …

被引用次数：158 相关文章所有 12 个版本

[PDF] thecvf.com

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

被引用次数：9 相关文章所有 3 个版本

[PDF] neurips.cc

Improving multimodal datasets with image captioning

T Nguyen, SY Gadre, G Ilharco… - Advances in Neural …, 2024 - proceedings.neurips.cc

Massive web datasets play a key role in the success of large vision-language models like
CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to …

被引用次数：30 相关文章所有 6 个版本

[PDF] thecvf.com

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

J Urbanek, F Bordes, P Astolfi… - Proceedings of the …, 2024 - openaccess.thecvf.com

Curation methods for massive vision-language datasets trade off between dataset size and
quality. However even the highest quality of available curated captions are far too short to …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

CLAIR: Evaluating image captions with large language models

D Chan, S Petryk, JE Gonzalez, T Darrell… - arXiv preprint arXiv …, 2023 - arxiv.org

The evaluation of machine-generated image captions poses an interesting yet persistent
challenge. Effective evaluation measures must consider numerous dimensions of similarity …

被引用次数：9 相关文章所有 7 个版本

[PDF] thecvf.com

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：8 相关文章所有 4 个版本

[PDF] thecvf.com

Robust cross-modal representation learning with progressive self-distillation

A Andonian, S Chen, R Hamid - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The learning objective of vision-language approach of CLIP does not effectively account for
the noisy many-to-many correspondences found in web-harvested image captioning …

被引用次数：43 相关文章所有 6 个版本

[PDF] arxiv.org

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du… - arXiv preprint arXiv …, 2023 - arxiv.org

Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

被引用次数：16 相关文章所有 3 个版本

[PDF] thecvf.com

Smallcap: lightweight image captioning prompted with retrieval augmentation

R Ramos, B Martins, D Elliott… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent advances in image captioning have focused on scaling the data and model size,
substantially increasing the cost of pre-training and finetuning. As an alternative to large …

被引用次数：45 相关文章所有 5 个版本

高级搜索

QQ 群

Capsfusion: Rethinking image-text data at scale

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

Improving multimodal datasets with image captioning

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

CLAIR: Evaluating image captions with large language models

Fusecap: Leveraging large language models for enriched fused image captions

Robust cross-modal representation learning with progressive self-distillation

From scarcity to efficiency: Improving clip training via visual-enriched captions

Smallcap: lightweight image captioning prompted with retrieval augmentation

引用