Howtocaption: Prompting llms to transform video annotations at scale

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - European Conference on …, 2024 - Springer

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Et bench: Towards open-ended event-level video-language understanding

Y Liu, Z Ma, Z Qi, Y Wu, Y Shan, CW Chen - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Leveraging large models for crafting narrative visualization: a survey

Y He, S Cao, Y Shi, Q Chen, K Xu, N Cao - arXiv preprint arXiv …, 2024 - arxiv.org

Narrative visualization effectively transforms data into engaging stories, making complex
information accessible to a broad audience. Large models, essential for narrative …

被引用次数：20 相关文章所有 2 个版本

[PDF] thecvf.com

AutoAD III: The Prequel-Back to the Pixels

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Generating Audio Description (AD) for movies is a challenging task that requires
fine-grained visual understanding and an awareness of the characters and their names …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Y Chen, K Li, W Bao, D Patel, Y Kong, MR Min… - … on Computer Vision, 2024 - Springer

Learning to localize temporal boundaries of procedure steps in instructional videos is
challenging due to the limited availability of annotated large-scale training videos. Recent …

被引用次数：1 相关文章所有 10 个版本

[PDF] thecvf.com

VidLA: Video-Language Alignment at Scale

MN Rizve, F Fei, J Unnikrishnan… - Proceedings of the …, 2024 - openaccess.thecvf.com

In this paper we propose VidLA an approach for video-language alignment at scale. There
are two major limitations of previous video-language alignment approaches. First they do …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

A Strong Baseline for Temporal Video-Text Alignment

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we consider the problem of temporally aligning the video and texts from
instructional videos, specifically, given a long-term video, and associated text sentences, our …

被引用次数：6 相关文章所有 2 个版本

[PDF] techscience.cn

[PDF][PDF] Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review

EMCL Ekanayake, AS Gezawa… - … MATERIALS & CONTINUA, 2024 - cdn.techscience.cn

Video description generates natural language sentences that describe the subject, verb, and
objects of the targeted Video. The video description has been used to help visually impaired …

Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators

CW Jo, M Wesołowska, M Wojcieszak - arXiv preprint arXiv:2411.05854, 2024 - arxiv.org

Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users
globally. These platforms expose users to harmful content, ranging from clickbait or physical …

被引用次数：1 相关文章

[PDF] openreview.net

Multi-Modal Inductive Framework for Text-Video Retrieval

Q Li, Y Zhou, C Ji, F Lu, J Gong, S Wang… - Proceedings of the 32nd …, 2024 - dl.acm.org

Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing
methods are limited by their ability to understand and connect different modalities, resulting …

高级搜索

QQ 群