Improving video captioning with temporal composition of a visual-syntactic embedding

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org

Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

被引用次数：98 相关文章所有 7 个版本

[PDF] arxiv.org

A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …

被引用次数：21 相关文章所有 3 个版本

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier

Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

被引用次数：555 相关文章所有 5 个版本

[PDF] acm.org

Clip4caption: Clip for video caption

M Tang, Z Wang, Z Liu, F Rao, D Li, X Li - Proceedings of the 29th ACM …, 2021 - dl.acm.org

Video captioning is a challenging task since it requires generating sentences describing
various diverse and complex videos. Existing video captioning models lack adequate visual …

被引用次数：152 相关文章所有 4 个版本

[PDF] arxiv.org

Decap: Decoding clip latents for zero-shot captioning via text-only training

W Li, L Zhu, L Wen, Y Yang - arXiv preprint arXiv:2303.03032, 2023 - arxiv.org

Large-scale pre-trained multi-modal models (eg, CLIP) demonstrate strong zero-shot
transfer capability in many discriminative tasks. Their adaptation to zero-shot image …

被引用次数：92 相关文章所有 3 个版本

[PDF] springer.com

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

被引用次数：25 相关文章所有 5 个版本

[PDF] neurips.cc

MI: Multi-modal Models Membership Inference

P Hu, Z Wang, R Sun, H Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc

With the development of machine learning techniques, the attention of research has been
moved from single-modal learning to multi-modal learning, as real-world data exist in the …

被引用次数：25 相关文章所有 6 个版本

[PDF] arxiv.org

Zero-shot video captioning with evolving pseudo-tokens

Y Tewel, Y Shalev, R Nadler, I Schwartz… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce a zero-shot video captioning method that employs two frozen networks: the
GPT-2 language model and the CLIP image-text matching model. The matching score is …

被引用次数：32 相关文章所有 4 个版本

[PDF] academia.edu

Exploring video captioning techniques: A comprehensive survey on deep learning methods

S Islam, A Dash, A Seum, AH Raj, T Hossain… - SN Computer …, 2021 - Springer

Video captioning is an automated collection of natural language phrases that explains the
contents in video frames. Because of the incomparable performance of deep learning in the …

被引用次数：40 相关文章所有 6 个版本

[PDF] arxiv.org

O2NA: An object-oriented non-autoregressive approach for controllable video captioning

F Liu, X Ren, X Wu, B Yang, S Ge, Y Zou… - arXiv preprint arXiv …, 2021 - arxiv.org

Video captioning combines video understanding and language generation. Different from
image captioning that describes a static image with details of almost every object, video …

被引用次数：42 相关文章所有 4 个版本

高级搜索

QQ 群