Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey

D Sharma, C Dhiman, D Kumar - Expert Systems with Applications, 2023 - Elsevier
Abstract Automatic Visual Captioning (AVC) generates syntactically and semantically correct
sentences by describing important objects, attributes, and their relationships with each other …

Learn to combine modalities in multimodal deep learning

K Liu, Y Li, N Xu, P Natarajan - arXiv preprint arXiv:1805.11730, 2018 - arxiv.org
Combining complementary information from multiple modalities is intuitively appealing for
improving the performance of learning-based approaches. However, it is challenging to fully …

Esresnet: Environmental sound classification based on visual domain models

A Guzhov, F Raue, J Hees… - 2020 25th international …, 2021 - ieeexplore.ieee.org
Environmental Sound Classification (ESC) is an active research area in the audio domain
and has seen a lot of progress in the past years. However, many of the existing approaches …

Describing videos using multi-modal fusion

Q Jin, J Chen, S Chen, Y Xiong… - Proceedings of the 24th …, 2016 - dl.acm.org
Describing videos with natural language is one of the ultimate goals of video understanding.
Video records multi-modal information including image, motion, aural, speech and so on …

Audio-visual transformer based crowd counting

U Sajid, X Chen, H Sajid, T Kim… - Proceedings of the …, 2021 - openaccess.thecvf.com
Crowd estimation is a very challenging problem. The most recent study tries to exploit
auditory information to aid the visual models, however, the performance is limited due to the …

Video captioning with guidance of multimodal latent topics

S Chen, J Chen, Q Jin, A Hauptmann - Proceedings of the 25th ACM …, 2017 - dl.acm.org
The topic diversity of open-domain videos leads to various vocabularies and linguistic
expressions in describing video contents, and therefore, makes the video captioning task …

Aomd: An analogy-aware approach to offensive meme detection on social media

L Shang, Y Zhang, Y Zha, Y Chen, C Youn… - Information Processing & …, 2021 - Elsevier
This paper focuses on an important problem of detecting offensive analogy meme on online
social media where the visual content and the texts/captions of the meme together make an …

Dense multimodal fusion for hierarchically joint representation

D Hu, C Wang, F Nie, X Li - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Multiple modalities can provide more valuable information than single one by describing the
same contents in various ways. Previous methods mainly focus on fusing the shallow …

Generating video descriptions with latent topic guidance

S Chen, Q Jin, J Chen… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Automatic video description generation (aka video captioning) is one of the ultimate goals
for video understanding. Despite the wide range of applications such as video indexing and …