A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

Clip4caption: Clip for video caption

M Tang, Z Wang, Z Liu, F Rao, D Li, X Li - Proceedings of the 29th ACM …, 2021 - dl.acm.org
Video captioning is a challenging task since it requires generating sentences describing
various diverse and complex videos. Existing video captioning models lack adequate visual …

Decap: Decoding clip latents for zero-shot captioning via text-only training

W Li, L Zhu, L Wen, Y Yang - arXiv preprint arXiv:2303.03032, 2023 - arxiv.org
Large-scale pre-trained multi-modal models (eg, CLIP) demonstrate strong zero-shot
transfer capability in many discriminative tasks. Their adaptation to zero-shot image …

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer
Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

MI: Multi-modal Models Membership Inference

P Hu, Z Wang, R Sun, H Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
With the development of machine learning techniques, the attention of research has been
moved from single-modal learning to multi-modal learning, as real-world data exist in the …

Zero-shot video captioning with evolving pseudo-tokens

Y Tewel, Y Shalev, R Nadler, I Schwartz… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce a zero-shot video captioning method that employs two frozen networks: the
GPT-2 language model and the CLIP image-text matching model. The matching score is …

Exploring video captioning techniques: A comprehensive survey on deep learning methods

S Islam, A Dash, A Seum, AH Raj, T Hossain… - SN Computer …, 2021 - Springer
Video captioning is an automated collection of natural language phrases that explains the
contents in video frames. Because of the incomparable performance of deep learning in the …

O2NA: An object-oriented non-autoregressive approach for controllable video captioning

F Liu, X Ren, X Wu, B Yang, S Ge, Y Zou… - arXiv preprint arXiv …, 2021 - arxiv.org
Video captioning combines video understanding and language generation. Different from
image captioning that describes a static image with details of almost every object, video …