Xgpt: Cross-modal generative pre-training for image captioning

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：188 相关文章所有 8 个版本

[PDF] arxiv.org

From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：394 相关文章所有 11 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：621 相关文章所有 9 个版本

[HTML] sciencedirect.com

[HTML][HTML] Pre-trained models: Past, present and future

X Han, Z Zhang, N Ding, Y Gu, X Liu, Y Huo, J Qiu… - AI Open, 2021 - Elsevier

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …

被引用次数：915 相关文章所有 9 个版本

[PDF] arxiv.org

A survey of vision-language pre-trained models

Y Du, Z Liu, J Li, WX Zhao - arXiv preprint arXiv:2202.10936, 2022 - arxiv.org

As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …

被引用次数：223 相关文章所有 3 个版本

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

被引用次数：216 相关文章所有 10 个版本

[PDF] arxiv.org

Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org

Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

被引用次数：405 相关文章所有 7 个版本

[PDF] neurips.cc

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc

We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

被引用次数：547 相关文章所有 8 个版本

[PDF] arxiv.org

Hero: Hierarchical encoder for video+ language omni-representation pre-training

L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu - arXiv preprint arXiv …, 2020 - arxiv.org

We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …

被引用次数：564 相关文章所有 7 个版本

[PDF] neurips.cc

Lift: Language-interfaced fine-tuning for non-language machine learning tasks

T Dinh, Y Zeng, R Zhang, Z Lin… - Advances in …, 2022 - proceedings.neurips.cc

Fine-tuning pretrained language models (LMs) without making any architectural changes
has become a norm for learning various language downstream tasks. However, for non …

被引用次数：129 相关文章所有 8 个版本

高级搜索

QQ 群