Multimodal attention for fusion of audio and spatiotemporal features for video description

A Jangra, S Mukherjee, A Jatowt, S Saha… - ACM Computing …, 2023 - dl.acm.org

The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …

被引用次数：63 相关文章所有 4 个版本

[PDF] arxiv.org

Adaptive context-aware multi-modal network for depth completion

S Zhao, M Gong, H Fu, D Tao - IEEE Transactions on Image …, 2021 - ieeexplore.ieee.org

Depth completion aims to recover a dense depth map from the sparse depth data and the
corresponding single RGB image. The observed pixels provide the significant guidance for …

被引用次数：175 相关文章所有 5 个版本

[PDF] thecvf.com

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com

Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …

被引用次数：219 相关文章所有 9 个版本

[PDF] thecvf.com

Cross-modal background suppression for audio-visual event localization

Y Xia, Z Zhao - Proceedings of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …

被引用次数：52 相关文章所有 3 个版本

[PDF] arxiv.org

Towards audio to scene image synthesis using generative adversarial network

CH Wan, SP Chuang, HY Lee - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org

Humans can imagine a scene from a sound. We want machines to do so by using
conditional generative adversarial networks (GANs). By applying the techniques including …

被引用次数：75 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Audio-based Active and Assisted Living: A review of selected applications and future trends

V Despotovic, P Pocta, A Zgank - Computers in Biology and Medicine, 2022 - Elsevier

The development of big data, machine learning, and the Internet of Things has led to rapid
advances in the research field of Active and Assisted Living (AAL). A human is placed in the …

被引用次数：8 相关文章所有 6 个版本

[PDF] aaai.org

Dynamic graph representation learning for video dialog via multi-modal shuffled transformers

S Geng, P Gao, M Chatterjee, C Hori… - Proceedings of the …, 2021 - ojs.aaai.org

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware
dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human …

被引用次数：50 相关文章所有 11 个版本

[PDF] arxiv.org

Large scale audiovisual learning of sounds with weakly labeled data

HM Fayek, A Kumar - arXiv preprint arXiv:2006.01595, 2020 - arxiv.org

Recognizing sounds is a key aspect of computational audio scene analysis and machine
perception. In this paper, we advocate that sound recognition is inherently a multi-modal …

被引用次数：41 相关文章所有 7 个版本

[PDF] mlr.press

An hybrid cnn-transformer model based on multi-feature extraction and attention fusion mechanism for cerebral emboli classification

Y Vindas, BK Guépié, M Almar… - Machine Learning …, 2022 - proceedings.mlr.press

When dealing with signal processing and deep learning for classification, the choice of
inputting whether the raw signal or transforming it into a time-frequency representation (TFR) …

被引用次数：11 相关文章所有 4 个版本

[PDF] ssrn.com

MPP-net: multi-perspective perception network for dense video captioning

Y Wei, S Yuan, M Chen, X Shen, L Wang, L Shen… - Neurocomputing, 2023 - Elsevier

Applying deformable transformer for dense video captioning has achieved great success
recently. However, deformable transformer only explores local-perspective perception by …

被引用次数：7 相关文章所有 4 个版本

高级搜索

QQ 群