Multi-granularity aggregation transformer for joint video-audio-text representation learning

Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild

X Zhang, M Li, S Lin, H Xu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Dynamic expression recognition in the wild is a challenging task due to various obstacles,
including low light condition, non-positive face, and face occlusion. Purely vision-based …

被引用次数：11 相关文章

[PDF] arxiv.org

Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey

P Sahoo, P Meharia, A Ghosh, S Saha, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

Hierarchical Multi-modal Attention Network for Time-sync Comment Video Recommendation

W Zhao, H Wu, W He, H Bi, H Wang… - … on Circuits and …, 2023 - ieeexplore.ieee.org

Due to inherent interactivity, time-sync comment of videos have attracted increasing
attention and were widely adopted in online video platforms. In addition to enhancing user …

被引用次数：1 相关文章

Key role guided transformer for group activity recognition

D Pei, D Huang, L Kong, Y Wang - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Group Activity Recognition (GAR) is a challenging task, where modeling spatio-temporal
relationships among participants plays a fundamental role. To address this issue, we …

被引用次数：4 相关文章所有 2 个版本

Transformer-based relational inference network for complex visual relational reasoning

M Tan, Z Wen, L Fang, Q Wu - ACM Transactions on Multimedia …, 2023 - dl.acm.org

Visual Relational Reasoning is the basis of many vision-and-language based tasks (eg,
visual question answering and referring expression comprehension). In this article, we …

被引用次数：2 相关文章

PLGNet: Prior-guided Local and Global Interactive Hybrid Network for Face Super-Resolution

L Li, Y Zhang, L Yuan, X Gao - IEEE Transactions on Circuits …, 2024 - ieeexplore.ieee.org

Recent CNN-driven face super-resolution (FSR) technologies have achieved excellent
breakthroughs by incorporating facial prior knowledge. However, most of them suffer from …

[PDF] arxiv.org

Audio-Infused Automatic Image Colorization by Exploiting Audio Scene Semantics

P Zhao, Y Chen, Y Zhao, W Jia, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Automatic image colorization is inherently an ill-posed problem with uncertainty, which
requires an accurate semantic understanding of scenes to estimate reasonable colors for …

高级搜索

QQ 群