Image captioning with semantic attention

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：162 相关文章所有 7 个版本

[PDF] sciencedirect.com

Recent advances and clinical applications of deep learning in medical image analysis

X Chen, X Wang, K Zhang, KM Fung, TC Thai… - Medical image …, 2022 - Elsevier

Deep learning has received extensive research interest in developing new medical image
processing algorithms, and deep learning based models have been remarkably successful …

被引用次数：483 相关文章所有 9 个版本

[PDF] arxiv.org

Tip-adapter: Training-free adaption of clip for few-shot classification

R Zhang, W Zhang, R Fang, P Gao, K Li, J Dai… - European conference on …, 2022 - Springer

Abstract Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations using large-scale image-text pairs. It shows …

被引用次数：233 相关文章所有 6 个版本

[PDF] openreview.net

Perceiver io: A general architecture for structured inputs & outputs

A Jaegle, S Borgeaud, JB Alayrac, C Doersch… - arXiv preprint arXiv …, 2021 - arxiv.org

A central goal of machine learning is the development of systems that can solve many
problems in as many data domains as possible. Current architectures, however, cannot be …

被引用次数：549 相关文章所有 4 个版本

[PDF] arxiv.org

Tip-adapter: Training-free clip-adapter for better vision-language modeling

R Zhang, R Fang, W Zhang, P Gao, K Li, J Dai… - arXiv preprint arXiv …, 2021 - arxiv.org

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for
learning visual representations by using large-scale contrastive image-text pairs. It shows …

被引用次数：325 相关文章所有 2 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：938 相关文章所有 12 个版本

[PDF] thecvf.com

4d-fy: Text-to-4d generation using hybrid score distillation sampling

S Bahmani, I Skorokhodov, V Rong… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-
video models to generate dynamic 3D scenes. However current text-to-4D methods face a …

被引用次数：43 相关文章所有 7 个版本

[PDF] arxiv.org

Chatcad: Interactive computer-aided diagnosis on medical image using large language models

S Wang, Z Zhao, X Ouyang, Q Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have recently demonstrated their potential in clinical
applications, providing valuable medical knowledge and advice. For example, a large dialog …

被引用次数：154 相关文章所有 3 个版本

[PDF] arxiv.org

A comprehensive survey on community detection with deep learning

X Su, S Xue, F Liu, J Wu, J Yang, C Zhou… - … on Neural Networks …, 2022 - ieeexplore.ieee.org

Detecting a community in a network is a matter of discerning the distinct features and
connections of a group of members that are different from those in other communities. The …

被引用次数：373 相关文章所有 12 个版本

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：330 相关文章所有 11 个版本

高级搜索

QQ 群