Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：142 相关文章所有 7 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：359 相关文章所有 9 个版本

[PDF] mlr.press

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press

Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

被引用次数：2821 相关文章所有 6 个版本

[PDF] arxiv.org

Violet: End-to-end video-language transformers with masked visual-token modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang… - arXiv preprint arXiv …, 2021 - arxiv.org

A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …

被引用次数：187 相关文章所有 2 个版本

[PDF] neurips.cc

Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias

Z Wan, C Liu, M Zhang, J Fu, B Wang… - Advances in …, 2024 - proceedings.neurips.cc

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-
training (VLP). A potential solution lies in the combination of datasets from various language …

被引用次数：38 相关文章所有 7 个版本

[PDF] mdpi.com

Deep vision multimodal learning: Methodology, benchmark, and trend

W Chai, G Wang - Applied Sciences, 2022 - mdpi.com

Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …

被引用次数：20 相关文章所有 4 个版本

[PDF] neurips.cc

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

被引用次数：77 相关文章所有 6 个版本

[PDF] arxiv.org

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

被引用次数：55 相关文章所有 5 个版本

[PDF] arxiv.org

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org

The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

被引用次数：39 相关文章所有 4 个版本

[PDF] arxiv.org

Class-agnostic object detection with multi-modal transformer

M Maaz, H Rasheed, S Khan, FS Khan… - European conference on …, 2022 - Springer

What constitutes an object? This has been a long-standing question in computer vision.
Towards this goal, numerous learning-free and learning-based approaches have been …

被引用次数：67 相关文章所有 11 个版本

高级搜索

QQ 群