Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

T Nguyen, Y Bin, X Wu, X Dong, Z Hu, K Le… - … on Computer Vision, 2024 - Springer
Data quality stands at the forefront of deciding the effectiveness of video-language
representation learning. However, video-text pairs in previous data typically do not align …

Kdmcse: Knowledge distillation multimodal sentence embeddings with adaptive angular margin contrastive learning

CD Nguyen, T Nguyen, X Wu, AT Luu - arXiv preprint arXiv:2403.17486, 2024 - arxiv.org
Previous work on multimodal sentence embedding has proposed multimodal contrastive
learning and achieved promising results. However, by taking the rest of the batch as …

Topic Modeling as Multi-Objective Contrastive Optimization

T Nguyen, X Wu, X Dong, CDT Nguyen, SK Ng… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent representation learning approaches enhance neural topic models by optimizing the
weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and …

Transformer-based correlation mining network with self-supervised label generation for multimodal sentiment analysis

R Wang, Q Yang, S Tian, L Yu, X He, B Wang - Neurocomputing, 2025 - Elsevier
Abstract Multimodal Sentiment Analysis (MSA) aims to recognize and understand a
speaker's sentiment state by integrating information from natural language, facial …

Multi-Scale Contrastive Learning for Video Temporal Grounding

TT Nguyen, Y Bin, X Wu, Z Hu, CDT Nguyen… - arXiv preprint arXiv …, 2024 - arxiv.org
Temporal grounding, which localizes video moments related to a natural language query, is
a core problem of vision-language learning and video understanding. To encode video …

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

TT Nguyen, X Wu, Y Bin, CDT Nguyen, SK Ng… - arXiv preprint arXiv …, 2024 - arxiv.org
To equip artificial intelligence with a comprehensive understanding towards a temporal
world, video and 4D panoptic scene graph generation abstracts visual data into nodes to …

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

T Nguyen, Y Bin, J Xiao, L Qu, Y Li, JZ Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans use multiple senses to comprehend the environment. Vision and language are two
of the most vital senses since they allow us to easily communicate our thoughts and …

Topic-Aware Causal Intervention for Counterfactual Detection

T Nguyen, TM Nguyen - arXiv preprint arXiv:2409.16668, 2024 - arxiv.org
Counterfactual statements, which describe events that did not or cannot take place, are
beneficial to numerous NLP applications. Hence, we consider the problem of counterfactual …

Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation

CD Nguyen, X Wu, T Nguyen, S Zhao, K Le… - arXiv preprint arXiv …, 2025 - arxiv.org
Previous research on multimodal entity linking (MEL) has primarily employed contrastive
learning as the primary objective. However, using the rest of the batch as negative samples …

PSACF: Parallel-Serial Attention-Based Cross-Fusion for Multimodal Emotion Recognition

Y Zhang, Y Liu, C Cheng - 2024 IEEE International Conference …, 2024 - ieeexplore.ieee.org
Multimodal emotion recognition (MER) has significantly improved by integrating features
from various modalities. However, imbalances and heterogeneity among modalities often …