From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

K Srinivasan, K Raman, J Chen, M Bendersky… - Proceedings of the 44th …, 2021 - dl.acm.org
The milestone improvements brought about by deep representation learning and pre-
training techniques have led to large performance gains across downstream NLP, IR and …

[HTML][HTML] AutoML: A systematic review on automated machine learning with neural architecture search

I Salehin, MS Islam, P Saha, SM Noman, A Tuni… - Journal of Information …, 2024 - Elsevier
Abstract AutoML (Automated Machine Learning) is an emerging field that aims to automate
the process of building machine learning models. AutoML emerged to increase productivity …

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arXiv preprint arXiv …, 2022 - arxiv.org
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

X Wang, J Wu, J Chen, L Li… - Proceedings of the …, 2019 - openaccess.thecvf.com
We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …

AI-empowered speed extraction via port-like videos for vehicular trajectory analysis

X Chen, Z Wang, Q Hua, WL Shang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Automated container terminal (ACT) is considered as port industry development direction,
and accurate kinematic data (speed, volume, etc.) is essential for enhancing ACT operation …

Visually grounded reasoning across languages and cultures

F Liu, E Bugliarello, EM Ponti, S Reddy… - arXiv preprint arXiv …, 2021 - arxiv.org
The design of widespread vision-and-language datasets and pre-trained encoders directly
adopts, or draws inspiration from, the concepts and images of ImageNet. While one can …

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Y Zeng, X Zhang, H Li, J Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Vision language pre-training aims to learn alignments between vision and language from a
large amount of data. Most existing methods only learn image-text alignments. Some others …

IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

E Bugliarello, F Liu, J Pfeiffer, S Reddy… - International …, 2022 - proceedings.mlr.press
Reliable evaluation benchmarks designed for replicability and comprehensiveness have
driven progress in machine learning. Due to the lack of a multilingual benchmark, however …