COCO-CN for cross-lingual image tagging, captioning, and retrieval

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

被引用次数：379 相关文章所有 11 个版本

[PDF] acm.org

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

K Srinivasan, K Raman, J Chen, M Bendersky… - Proceedings of the 44th …, 2021 - dl.acm.org

The milestone improvements brought about by deep representation learning and pre-
training techniques have led to large performance gains across downstream NLP, IR and …

被引用次数：301 相关文章所有 8 个版本

[HTML] sciencedirect.com

[HTML][HTML] AutoML: A systematic review on automated machine learning with neural architecture search

I Salehin, MS Islam, P Saha, SM Noman, A Tuni… - Journal of Information …, 2024 - Elsevier

Abstract AutoML (Automated Machine Learning) is an emerging field that aims to automate
the process of building machine learning models. AutoML emerged to increase productivity …

被引用次数：51 相关文章所有 2 个版本

[PDF] arxiv.org

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arXiv preprint arXiv …, 2022 - arxiv.org

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

被引用次数：119 相关文章所有 2 个版本

[PDF] thecvf.com

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

X Wang, J Wu, J Chen, L Li… - Proceedings of the …, 2019 - openaccess.thecvf.com

We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …

被引用次数：598 相关文章所有 8 个版本

[PDF] github.io

AI-empowered speed extraction via port-like videos for vehicular trajectory analysis

X Chen, Z Wang, Q Hua, WL Shang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Automated container terminal (ACT) is considered as port industry development direction,
and accurate kinematic data (speed, volume, etc.) is essential for enhancing ACT operation …

被引用次数：101 相关文章所有 4 个版本

[PDF] arxiv.org

Visually grounded reasoning across languages and cultures

F Liu, E Bugliarello, EM Ponti, S Reddy… - arXiv preprint arXiv …, 2021 - arxiv.org

The design of widespread vision-and-language datasets and pre-trained encoders directly
adopts, or draws inspiration from, the concepts and images of ImageNet. While one can …

被引用次数：157 相关文章所有 9 个版本

[PDF] arxiv.org

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

被引用次数：75 相关文章所有 5 个版本

[PDF] arxiv.org

X²-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Y Zeng, X Zhang, H Li, J Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Vision language pre-training aims to learn alignments between vision and language from a
large amount of data. Most existing methods only learn image-text alignments. Some others …

被引用次数：59 相关文章所有 8 个版本

[PDF] mlr.press

IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

E Bugliarello, F Liu, J Pfeiffer, S Reddy… - International …, 2022 - proceedings.mlr.press

Reliable evaluation benchmarks designed for replicability and comprehensiveness have
driven progress in machine learning. Due to the lack of a multilingual benchmark, however …

被引用次数：57 相关文章所有 5 个版本

高级搜索

QQ 群