Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：131 相关文章所有 8 个版本

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

被引用次数：171 相关文章所有 10 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：422 相关文章所有 9 个版本

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving

K Chitta, A Prakash, B Jaeger, Z Yu… - … on Pattern Analysis …, 2022 - ieeexplore.ieee.org

How should we integrate representations from complementary sensors for autonomous
driving? Geometry-based fusion has shown promise for perception (eg, object detection …

被引用次数：220 相关文章所有 12 个版本

[PDF] thecvf.com

Vision-language pre-training with triple contrastive learning

J Yang, J Duan, S Tran, Y Xu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …

被引用次数：253 相关文章所有 8 个版本

[PDF] arxiv.org

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org

Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

被引用次数：486 相关文章所有 4 个版本

[PDF] thecvf.com

Multi-modal fusion transformer for end-to-end autonomous driving

A Prakash, K Chitta, A Geiger - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

How should representations from complementary sensors be integrated for autonomous
driving? Geometry-based sensor fusion has shown great promise for perception tasks such …

被引用次数：506 相关文章所有 9 个版本

[PDF] mlr.press

Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

被引用次数：20617 相关文章所有 20 个版本

[PDF] thecvf.com

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

被引用次数：98 相关文章所有 10 个版本

[HTML] sciencedirect.com

[HTML][HTML] Pre-trained models: Past, present and future

X Han, Z Zhang, N Ding, Y Gu, X Liu, Y Huo, J Qiu… - AI Open, 2021 - Elsevier

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …

被引用次数：719 相关文章所有 9 个版本

高级搜索

QQ 群