A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

Recent advances and trends in multimodal deep learning: A review

J Summaira, X Li, AM Shoib, S Li, J Abdul - arXiv preprint arXiv …, 2021 - arxiv.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning is to create models that can …

Generating diverse and natural 3d human motions from text

C Guo, S Zou, X Zuo, S Wang, W Ji… - Proceedings of the …, 2022 - openaccess.thecvf.com
Automated generation of 3D human motions from text is a challenging problem. The
generated motions are expected to be sufficiently diverse to explore the text-grounded …

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

Object relational graph with teacher-recommended learning for video captioning

Z Zhang, Y Shi, C Yuan, B Li, P Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com
Taking full advantage of the information from both vision and language is critical for the
video captioning task. Existing models lack adequate visual representation due to the …

AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

Spatio-temporal graph for video captioning with knowledge distillation

B Pan, H Cai, DA Huang, KH Lee… - Proceedings of the …, 2020 - openaccess.thecvf.com
Video captioning is a challenging task that requires a deep understanding of visual scenes.
State-of-the-art methods generate captions using either scene-level or object-level …

Contrastive attention for automatic chest x-ray report generation

F Liu, C Yin, X Wu, S Ge, Y Zou, P Zhang… - arXiv preprint arXiv …, 2021 - arxiv.org
Recently, chest X-ray report generation, which aims to automatically generate descriptions
of given chest X-ray images, has received growing research interests. The key challenge of …

Semantic grouping network for video captioning

H Ryu, S Kang, H Kang, CD Yoo - … of the AAAI Conference on Artificial …, 2021 - ojs.aaai.org
This paper considers a video caption generating network referred to as Semantic Grouping
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …