X-lxmert: Paint, caption and answer questions with multi-modal transformers

A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt

Y Cao, S Li, Y Liu, Z Yan, Y Dai, PS Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, ChatGPT, along with DALL-E-2 and Codex, has been gaining significant attention
from society. As a result, many individuals have become interested in related resources and …

被引用次数：508 相关文章所有 2 个版本

[PDF] arxiv.org

Multimodal image synthesis and editing: A survey and taxonomy

F Zhan, Y Yu, R Wu, J Zhang, S Lu, L Liu… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …

被引用次数：181 相关文章所有 11 个版本

[PDF] 3dvar.com

[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation

J Yu, Y Xu, JY Koh, T Luong, G Baid, Z Wang… - arXiv preprint arXiv …, 2022 - 3dvar.com

Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …

被引用次数：785 相关文章所有 5 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：370 相关文章所有 9 个版本

[PDF] arxiv.org

Layoutlmv3: Pre-training for document ai with unified text and image masking

Y Huang, T Lv, L Cui, Y Lu, F Wei - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Self-supervised pre-training techniques have achieved remarkable progress in Document
AI. Most multimodal pre-trained models use a masked language modeling objective to learn …

被引用次数：317 相关文章所有 3 个版本

[PDF] thecvf.com

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Y Hu, B Liu, J Kasai, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …

被引用次数：82 相关文章所有 5 个版本

[PDF] mlr.press

Zero-shot text-to-image generation

A Ramesh, M Pavlov, G Goh, S Gray… - International …, 2021 - proceedings.mlr.press

Text-to-image generation has traditionally focused on finding better modeling assumptions
for training on a fixed dataset. These assumptions might involve complex architectures …

被引用次数：4127 相关文章所有 5 个版本

[PDF] mlr.press

Unifying vision-and-language tasks via text generation

J Cho, J Lei, H Tan, M Bansal - International Conference on …, 2021 - proceedings.mlr.press

Existing methods for vision-and-language learning typically require designing task-specific
architectures and objectives for each task. For example, a multi-label answer classifier for …

被引用次数：478 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of vision-language pre-trained models

Y Du, Z Liu, J Li, WX Zhao - arXiv preprint arXiv:2202.10936, 2022 - arxiv.org

As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …

被引用次数：166 相关文章所有 3 个版本

[PDF] thecvf.com

Incremental transformer structure enhanced image inpainting with masking positional encoding

Q Dong, C Cao, Y Fu - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com

Image inpainting has made significant advances in recent years. However, it is still
challenging to recover corrupted images with both vivid textures and reasonable structures …

被引用次数：132 相关文章所有 6 个版本

高级搜索

QQ 群