Multi-sentence Grounding for Long-Term Instructional Video

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - European Conference on …, 2024 - Springer
In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …

Et bench: Towards open-ended event-level video-language understanding

Y Liu, Z Ma, Z Qi, Y Wu, Y Shan, CW Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …

Leveraging large models for crafting narrative visualization: a survey

Y He, S Cao, Y Shi, Q Chen, K Xu, N Cao - arXiv preprint arXiv …, 2024 - arxiv.org
Narrative visualization effectively transforms data into engaging stories, making complex
information accessible to a broad audience. Large models, essential for narrative …

AutoAD III: The Prequel-Back to the Pixels

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generating Audio Description (AD) for movies is a challenging task that requires
fine-grained visual understanding and an awareness of the characters and their names …

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Y Chen, K Li, W Bao, D Patel, Y Kong, MR Min… - … on Computer Vision, 2024 - Springer
Learning to localize temporal boundaries of procedure steps in instructional videos is
challenging due to the limited availability of annotated large-scale training videos. Recent …

VidLA: Video-Language Alignment at Scale

MN Rizve, F Fei, J Unnikrishnan… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we propose VidLA an approach for video-language alignment at scale. There
are two major limitations of previous video-language alignment approaches. First they do …

A Strong Baseline for Temporal Video-Text Alignment

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we consider the problem of temporally aligning the video and texts from
instructional videos, specifically, given a long-term video, and associated text sentences, our …

[PDF][PDF] Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review

EMCL Ekanayake, AS Gezawa… - … MATERIALS & CONTINUA, 2024 - cdn.techscience.cn
Video description generates natural language sentences that describe the subject, verb, and
objects of the targeted Video. The video description has been used to help visually impaired …

Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators

CW Jo, M Wesołowska, M Wojcieszak - arXiv preprint arXiv:2411.05854, 2024 - arxiv.org
Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users
globally. These platforms expose users to harmful content, ranging from clickbait or physical …

Multi-Modal Inductive Framework for Text-Video Retrieval

Q Li, Y Zhou, C Ji, F Lu, J Gong, S Wang… - Proceedings of the 32nd …, 2024 - dl.acm.org
Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing
methods are limited by their ability to understand and connect different modalities, resulting …