A survey on multi-modal summarization

A Jangra, S Mukherjee, A Jatowt, S Saha… - ACM Computing …, 2023 - dl.acm.org
The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …

Adaptive context-aware multi-modal network for depth completion

S Zhao, M Gong, H Fu, D Tao - IEEE Transactions on Image …, 2021 - ieeexplore.ieee.org
Depth completion aims to recover a dense depth map from the sparse depth data and the
corresponding single RGB image. The observed pixels provide the significant guidance for …

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …

Cross-modal background suppression for audio-visual event localization

Y Xia, Z Zhao - Proceedings of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …

Towards audio to scene image synthesis using generative adversarial network

CH Wan, SP Chuang, HY Lee - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Humans can imagine a scene from a sound. We want machines to do so by using
conditional generative adversarial networks (GANs). By applying the techniques including …

[HTML][HTML] Audio-based Active and Assisted Living: A review of selected applications and future trends

V Despotovic, P Pocta, A Zgank - Computers in Biology and Medicine, 2022 - Elsevier
The development of big data, machine learning, and the Internet of Things has led to rapid
advances in the research field of Active and Assisted Living (AAL). A human is placed in the …

Dynamic graph representation learning for video dialog via multi-modal shuffled transformers

S Geng, P Gao, M Chatterjee, C Hori… - Proceedings of the …, 2021 - ojs.aaai.org
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware
dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human …

Large scale audiovisual learning of sounds with weakly labeled data

HM Fayek, A Kumar - arXiv preprint arXiv:2006.01595, 2020 - arxiv.org
Recognizing sounds is a key aspect of computational audio scene analysis and machine
perception. In this paper, we advocate that sound recognition is inherently a multi-modal …

An hybrid cnn-transformer model based on multi-feature extraction and attention fusion mechanism for cerebral emboli classification

Y Vindas, BK Guépié, M Almar… - Machine Learning …, 2022 - proceedings.mlr.press
When dealing with signal processing and deep learning for classification, the choice of
inputting whether the raw signal or transforming it into a time-frequency representation (TFR) …

MPP-net: multi-perspective perception network for dense video captioning

Y Wei, S Yuan, M Chen, X Shen, L Wang, L Shen… - Neurocomputing, 2023 - Elsevier
Applying deformable transformer for dense video captioning has achieved great success
recently. However, deformable transformer only explores local-perspective perception by …