Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis

W Han, H Chen, A Gelbukh, A Zadeh… - Proceedings of the …, 2021 - dl.acm.org
Multimodal sentiment analysis aims to extract and integrate semantic information collected
from multiple modalities to recognize the expressed emotions and sentiment in multimodal …

HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention

S Geng, J Yuan, Y Tian, Y Chen, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The success of large-scale contrastive vision-language pretraining (CLIP) has benefited
both visual recognition and multimodal content understanding. The concise design brings …

Sequence-to-sequence learning with latent neural grammars

Y Kim - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc
Sequence-to-sequence learning with neural networks has become the de facto standard for
sequence modeling. This approach typically models the local distribution over the next …

Grounding'grounding'in NLP

KR Chandu, Y Bisk, AW Black - arXiv preprint arXiv:2106.02192, 2021 - arxiv.org
The NLP community has seen substantial recent interest in grounding to facilitate interaction
between language technologies and the world. However, as a community, we use the term …

MCSE: Multimodal contrastive learning of sentence embeddings

M Zhang, M Mosbach, DI Adelani… - arXiv preprint arXiv …, 2022 - arxiv.org
Learning semantically meaningful sentence embeddings is an open problem in natural
language processing. In this work, we propose a sentence embedding learning approach …

Valhalla: Visual hallucination for machine translation

Y Li, R Panda, Y Kim, CFR Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Designing better machine translation systems by considering auxiliary inputs such as
images has attracted much attention in recent years. While existing methods show promising …

Vlgrammar: Grounded grammar induction of vision and language

Y Hong, Q Li, SC Zhu, S Huang - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Cognitive grammar suggests that the acquisition of language grammar is grounded within
visual structures. While grammar is an essential representation of natural language, it also …

Video-aided unsupervised grammar induction

S Zhang, L Song, L Jin, K Xu, D Yu, J Luo - arXiv preprint arXiv …, 2021 - arxiv.org
We investigate video-aided grammar induction, which learns a constituency parser from
both unlabeled text and its corresponding video. Existing methods of multi-modal grammar …

[PDF][PDF] Unsupervised vision-language grammar induction with shared structure modeling

B Wan, W Han, Z Zheng, T Tuytelaars - Proceedings ICLR 2022, 2022 - lirias.kuleuven.be
We introduce a new task, unsupervised vision-language (VL) grammar induction. Given an
image-caption pair, the goal is to extract a shared hierarchical structure for both image and …

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

Z Li, LT Yang, X Nie, BC Ren, X Deng - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Large-scale pre-trained language models have garnered significant attention in recent years
due to their effectiveness in extracting sentence representations. However, most pre-trained …