Previous work on multimodal sentence embedding has proposed multimodal contrastive learning and achieved promising results. However, by taking the rest of the batch as …
Recent representation learning approaches enhance neural topic models by optimizing the weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and …
R Wang, Q Yang, S Tian, L Yu, X He, B Wang - Neurocomputing, 2025 - Elsevier
Abstract Multimodal Sentiment Analysis (MSA) aims to recognize and understand a speaker's sentiment state by integrating information from natural language, facial …
Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video …
To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to …
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and …
Counterfactual statements, which describe events that did not or cannot take place, are beneficial to numerous NLP applications. Hence, we consider the problem of counterfactual …
Previous research on multimodal entity linking (MEL) has primarily employed contrastive learning as the primary objective. However, using the rest of the batch as negative samples …
Y Zhang, Y Liu, C Cheng - 2024 IEEE International Conference …, 2024 - ieeexplore.ieee.org
Multimodal emotion recognition (MER) has significantly improved by integrating features from various modalities. However, imbalances and heterogeneity among modalities often …