Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Video pivoting unsupervised multi-modal machine translation

M Li, PY Huang, X Chang, J Hu, Y Yang… - … on Pattern Analysis …, 2022 - ieeexplore.ieee.org
The main challenge in the field of unsupervised machine translation (UMT) is to associate
source-target sentences in the latent space. As people who speak different languages share …

Support-set bottlenecks for video-text representation learning

M Patrick, PY Huang, Y Asano, F Metze… - arXiv preprint arXiv …, 2020 - arxiv.org
The dominant paradigm for learning video-text representations--noise contrastive learning--
increases the similarity of the representations of pairs of samples that are known to be …

Experience grounds language

Y Bisk, A Holtzman, J Thomason, J Andreas… - arXiv preprint arXiv …, 2020 - arxiv.org
Language understanding research is held back by a failure to relate language to the
physical world it describes and to the social interactions it facilitates. Despite the incredible …

Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis

W Han, H Chen, A Gelbukh, A Zadeh… - Proceedings of the …, 2021 - dl.acm.org
Multimodal sentiment analysis aims to extract and integrate semantic information collected
from multiple modalities to recognize the expressed emotions and sentiment in multimodal …

Deep vision multimodal learning: Methodology, benchmark, and trend

W Chai, G Wang - Applied Sciences, 2022 - mdpi.com
Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …

Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination

H Fei, Q Liu, M Zhang, M Zhang, TS Chua - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we investigate a more realistic unsupervised multimodal machine translation
(UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text …

IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

E Bugliarello, F Liu, J Pfeiffer, S Reddy… - International …, 2022 - proceedings.mlr.press
Reliable evaluation benchmarks designed for replicability and comprehensiveness have
driven progress in machine learning. Due to the lack of a multilingual benchmark, however …

Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment

S Wu, H Fei, W Ji, TS Chua - arXiv preprint arXiv:2305.12260, 2023 - arxiv.org
Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency
issues, due to the inconsistencies of the semantic scene and syntax attributes during …

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …