Neural baby talk

Z Ji, N Lee, R Frieske, T Yu, D Su, Y Xu, E Ishii… - ACM Computing …, 2023 - dl.acm.org

Natural Language Generation (NLG) has improved exponentially in recent years thanks to
the development of sequence-to-sequence deep learning technologies such as Transformer …

被引用次数：2015 相关文章所有 7 个版本

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：296 相关文章所有 11 个版本

[PDF] stableaiprompts.com

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Z Yang, L Li, K Lin, J Wang, CC Lin… - arXiv preprint arXiv …, 2023 - stableaiprompts.com

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory
skills, such as visual understanding, to achieve stronger generic intelligence. In this paper …

被引用次数：320 相关文章所有 3 个版本

[PDF] neurips.cc

Scalable 3d captioning with pretrained models

T Luo, C Rockwell, H Lee… - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …

被引用次数：78 相关文章所有 6 个版本

[PDF] thecvf.com

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

被引用次数：60 相关文章所有 3 个版本

[PDF] thecvf.com

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com

The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

被引用次数：791 相关文章所有 9 个版本

[PDF] thecvf.com

Ai choreographer: Music conditioned 3d dance generation with aist++

R Li, S Yang, DA Ross… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …

被引用次数：384 相关文章所有 6 个版本

[PDF] thecvf.com

Meshed-memory transformer for image captioning

M Cornia, M Stefanini, L Baraldi… - Proceedings of the …, 2020 - openaccess.thecvf.com

Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …

被引用次数：1021 相关文章所有 13 个版本

[PDF] arxiv.org

Vision Transformers in medical computer vision—A contemplative retrospection

A Parvaiz, MA Khalid, R Zafar, H Ameer, M Ali… - … Applications of Artificial …, 2023 - Elsevier

Abstract Vision Transformers (ViTs), with the magnificent potential to unravel the information
contained within images, have evolved as one of the most contemporary and dominant …

被引用次数：105 相关文章所有 6 个版本

[PDF] aaai.org

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

被引用次数：897 相关文章所有 7 个版本

高级搜索

QQ 群

Survey of hallucination in natural language generation

From show to tell: A survey on deep learning-based image captioning

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Scalable 3d captioning with pretrained models

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Ai choreographer: Music conditioned 3d dance generation with aist++

Meshed-memory transformer for image captioning

Vision Transformers in medical computer vision—A contemplative retrospection

Unified vision-language pre-training for image captioning and vqa

引用