- 学术资源搜索

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：320 相关文章所有 11 个版本

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：137 相关文章所有 2 个版本

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：197 相关文章所有 9 个版本

[PDF] neurips.cc

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

被引用次数：241 相关文章所有 4 个版本

[PDF] mlr.press

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press

Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

被引用次数：3110 相关文章所有 6 个版本

[PDF] thecvf.com

Clip-forge: Towards zero-shot text-to-shape generation

A Sanghi, H Chu, JG Lambourne… - Proceedings of the …, 2022 - openaccess.thecvf.com

Generating shapes using natural language can enable new ways of imagining and creating
the things around us. While significant recent progress has been made in text-to-image …

被引用次数：261 相关文章所有 8 个版本

[PDF] mlr.press

Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

被引用次数：20848 相关文章所有 20 个版本

[PDF] thecvf.com

Negative-aware attention framework for image-text matching

K Zhang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …

被引用次数：112 相关文章所有 4 个版本

[PDF] thecvf.com

Tedigan: Text-guided diverse face image generation and manipulation

W Xia, Y Yang, JH Xue, B Wu - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

In this work, we propose TediGAN, a novel framework for multi-modal image generation and
manipulation with textual descriptions. The proposed method consists of three components …

被引用次数：379 相关文章所有 8 个版本

[PDF] arxiv.org

Multi-modal transformer for video retrieval

V Gabeur, C Sun, K Alahari, C Schmid - … 28, 2020, Proceedings, Part IV 16, 2020 - Springer

The task of retrieving video content relevant to natural language queries plays a critical role
in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …

被引用次数：676 相关文章所有 13 个版本

高级搜索

QQ 群

From show to tell: A survey on deep learning-based image captioning

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Vision-language models for vision tasks: A survey

Glipv2: Unifying localization and vision-language understanding

Scaling up visual and vision-language representation learning with noisy text supervision

Clip-forge: Towards zero-shot text-to-shape generation

Learning transferable visual models from natural language supervision

Negative-aware attention framework for image-text matching

Tedigan: Text-guided diverse face image generation and manipulation

Multi-modal transformer for video retrieval

引用