Mm21 pre-training for video understanding challenge: Video captioning with pretraining techniques

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

被引用次数：99 相关文章所有 6 个版本

[PDF] arxiv.org

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

被引用次数：96 相关文章所有 4 个版本

[PDF] arxiv.org

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2024 - Springer

Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

被引用次数：5 相关文章所有 11 个版本

Valor: Vision-audio-language omni-perception pretraining model and dataset

J Liu, S Chen, X He, L Guo, X Zhu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …

被引用次数：4 相关文章所有 6 个版本

[PDF] techscience.cn

[PDF][PDF] Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review

EMCL Ekanayake, AS Gezawa… - … MATERIALS & CONTINUA, 2024 - cdn.techscience.cn

Video description generates natural language sentences that describe the subject, verb, and
objects of the targeted Video. The video description has been used to help visually impaired …

GPT-based knowledge guiding network for commonsense video captioning

M Yuan, G Jia, BK Bao - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org

Video-based commonsense captioning aims to generate captions for the video content
while providing multiple commonsense about the underlying event. Existing methods utilize …

被引用次数：2 相关文章

Multimodal early fusion operators for temporal video scene segmentation tasks

AAR Beserra, R Goularte - Multimedia Tools and Applications, 2023 - Springer

Abstract The Temporal Video Scene Segmentation (TVSS) task is still an open problem
presenting challenges in the Multimedia Analysis area. Current approaches employ …

被引用次数：4 相关文章所有 4 个版本

[PDF] thecvf.com

Diffusion-Based Multimodal Video Captioning

J Kainulainen, Z Guo… - Proceedings of the Asian …, 2024 - openaccess.thecvf.com

Diffusion-based models have recently demonstrated notable success in various generative
tasks involving continuous signals, such as image, video, and audio synthesis. However …

高级搜索

QQ 群