In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents …
B Huang, X Wang, H Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models (LLMs) have shown remarkable text understanding capabilities which have been extended as Video LLMs to handle video data for comprehending visual …
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale …
Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal …
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often …
H Alwassel, S Giancola… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are …