Video description generation using audio and visual cues

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

被引用次数：3400 相关文章所有 12 个版本

Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey

D Sharma, C Dhiman, D Kumar - Expert Systems with Applications, 2023 - Elsevier

Abstract Automatic Visual Captioning (AVC) generates syntactically and semantically correct
sentences by describing important objects, attributes, and their relationships with each other …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Learn to combine modalities in multimodal deep learning

K Liu, Y Li, N Xu, P Natarajan - arXiv preprint arXiv:1805.11730, 2018 - arxiv.org

Combining complementary information from multiple modalities is intuitively appealing for
improving the performance of learning-based approaches. However, it is challenging to fully …

被引用次数：224 相关文章所有 2 个版本

[PDF] arxiv.org

Esresnet: Environmental sound classification based on visual domain models

A Guzhov, F Raue, J Hees… - 2020 25th international …, 2021 - ieeexplore.ieee.org

Environmental Sound Classification (ESC) is an active research area in the audio domain
and has seen a lot of progress in the past years. However, many of the existing approaches …

被引用次数：132 相关文章所有 9 个版本

[PDF] academia.edu

Describing videos using multi-modal fusion

Q Jin, J Chen, S Chen, Y Xiong… - Proceedings of the 24th …, 2016 - dl.acm.org

Describing videos with natural language is one of the ultimate goals of video understanding.
Video records multi-modal information including image, motion, aural, speech and so on …

被引用次数：119 相关文章所有 2 个版本

[PDF] thecvf.com

Audio-visual transformer based crowd counting

U Sajid, X Chen, H Sajid, T Kim… - Proceedings of the …, 2021 - openaccess.thecvf.com

Crowd estimation is a very challenging problem. The most recent study tries to exploit
auditory information to aid the visual models, however, the performance is limited due to the …

被引用次数：33 相关文章所有 8 个版本

[PDF] arxiv.org

Video captioning with guidance of multimodal latent topics

S Chen, J Chen, Q Jin, A Hauptmann - Proceedings of the 25th ACM …, 2017 - dl.acm.org

The topic diversity of open-domain videos leads to various vocabularies and linguistic
expressions in describing video contents, and therefore, makes the video captioning task …

被引用次数：74 相关文章所有 4 个版本

[PDF] sciencedirect.com

Aomd: An analogy-aware approach to offensive meme detection on social media

L Shang, Y Zhang, Y Zha, Y Chen, C Youn… - Information Processing & …, 2021 - Elsevier

This paper focuses on an important problem of detecting offensive analogy meme on online
social media where the visual content and the texts/captions of the meme together make an …

被引用次数：25 相关文章所有 5 个版本

[PDF] arxiv.org

Dense multimodal fusion for hierarchically joint representation

D Hu, C Wang, F Nie, X Li - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org

Multiple modalities can provide more valuable information than single one by describing the
same contents in various ways. Previous methods mainly focus on fusing the shallow …

被引用次数：47 相关文章所有 4 个版本

Generating video descriptions with latent topic guidance

S Chen, Q Jin, J Chen… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org

Automatic video description generation (aka video captioning) is one of the ultimate goals
for video understanding. Despite the wide range of applications such as video indexing and …

被引用次数：41 相关文章所有 2 个版本

高级搜索

QQ 群