Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2024 - Springer
Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

Valor: Vision-audio-language omni-perception pretraining model and dataset

J Liu, S Chen, X He, L Guo, X Zhu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …

[PDF][PDF] Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review

EMCL Ekanayake, AS Gezawa… - … MATERIALS & CONTINUA, 2024 - cdn.techscience.cn
Video description generates natural language sentences that describe the subject, verb, and
objects of the targeted Video. The video description has been used to help visually impaired …

GPT-based knowledge guiding network for commonsense video captioning

M Yuan, G Jia, BK Bao - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Video-based commonsense captioning aims to generate captions for the video content
while providing multiple commonsense about the underlying event. Existing methods utilize …

Multimodal early fusion operators for temporal video scene segmentation tasks

AAR Beserra, R Goularte - Multimedia Tools and Applications, 2023 - Springer
Abstract The Temporal Video Scene Segmentation (TVSS) task is still an open problem
presenting challenges in the Multimedia Analysis area. Current approaches employ …

Diffusion-Based Multimodal Video Captioning

J Kainulainen, Z Guo… - Proceedings of the Asian …, 2024 - openaccess.thecvf.com
Diffusion-based models have recently demonstrated notable success in various generative
tasks involving continuous signals, such as image, video, and audio synthesis. However …