Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased...

I Qasim, A Horsch, D Prasad - ACM Computing Surveys, 2023 - dl.acm.org

Untrimmed videos have interrelated events, dependencies, context, overlapping events,
object-object interactions, domain specificity, and other semantics that are worth highlighting …

被引用次数：2 相关文章所有 2 个版本

Magdra: a multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition

X Li, J Liu, Y Xie, P Gong, X Zhang, H He - Knowledge-Based Systems, 2024 - Elsevier

Multimodal multi-label emotion recognition (MMER) is a vital yet challenging task in affective
computing. Despite significant progress in previous works, there are three limitations:(i) …

被引用次数：17 相关文章所有 2 个版本

[PDF] ieee.org

Knowledge Graph Based on Reinforcement Learning: A Survey and New Perspectives

Q Huo, H Fu, C Song, Q Sun, P Xu, K Qu, H Feng… - IEEE …, 2024 - ieeexplore.ieee.org

Knowledge graph is a form of data representation that uses graph structure to model the
connections between things. The intention of knowledge graph is to optimize the results …

VT-3DCapsNet: Visual tempos 3D-Capsule network for video-based facial expression recognition

Z Li, J Liu, H Wang, X Zhang, Z Wu, B Han - Plos one, 2024 - journals.plos.org

Facial expression recognition (FER) is a hot topic in computer vision, especially as deep
learning based methods are gaining traction in this field. However, traditional convolutional …

[HTML][HTML] Implicit and explicit commonsense for multi-sentence video captioning

SH Chou, JJ Little, L Sigal - Computer Vision and Image Understanding, 2024 - Elsevier

Existing dense or paragraph video captioning approaches rely on holistic representations of
videos, possibly coupled with learned object/action representations, to condition hierarchical …

被引用次数：2 相关文章所有 3 个版本

Custom CNN-BiLSTM model for video captioning

AR Chougule, SD Chavan - Multimedia Tools and Applications, 2024 - Springer

This paper introduces a video captioning model that integrates spatial and temporal feature
extraction methods to produce comprehensive textual descriptions for videos. The …

[PDF] wiley.com Full View

Tag‐inferring and tag‐guided Transformer for image captioning

Y Yi, Y Liang, D Kong, Z Tang, J Peng - IET Computer Vision, 2024 - Wiley Online Library

Image captioning is an important task for understanding images. Recently, many studies
have used tags to build alignments between image information and language information …

Unlocking Cognitive Insights: Leveraging Transfer Learning for Caption Generation in Signed Language Videos

AM Pol, SA Patil - 2024 11th International Conference on …, 2024 - ieeexplore.ieee.org

The research delves into sign language recognition from video, employing a transfer
learning approach. Established models like VGGNets, ResNets, DenseNet, Inception, and …

A Novel Cyber-Threat Awareness Framework based on Spatial-Temporal Transformer Encoder for Maritime Transportation Systems

Q Shi, J Liu, J Zhi, P Gong, Z Wu… - 2023 7th International …, 2023 - ieeexplore.ieee.org

Modern ships have leveraged the development of IoT and AI 2.0 to integrate a vast array of
digital infrastructure and navigation-dependent operating systems, facilitating the …

Review on scene graph generation methods

S NC - Multiagent and Grid Systems, 2024 - content.iospress.com

A scene graph generation is a structured way of representing the image in a graphical
network and it is mostly used to describe a scene's objects and attributes and the …

高级搜索

QQ 群