A better use of audio-visual cues: Dense video captioning with bi-modal transformer

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier

Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

被引用次数：97 相关文章所有 5 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：167 相关文章所有 26 个版本

[PDF] thecvf.com

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

被引用次数：180 相关文章所有 6 个版本

[PDF] thecvf.com

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

被引用次数：179 相关文章所有 6 个版本

[PDF] thecvf.com

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

被引用次数：20 相关文章所有 7 个版本

[PDF] thecvf.com

Vtimellm: Empower llm to grasp video moments

B Huang, X Wang, H Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models (LLMs) have shown remarkable text understanding capabilities
which have been extended as Video LLMs to handle video data for comprehending visual …

被引用次数：34 相关文章所有 4 个版本

[PDF] arxiv.org

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

被引用次数：108 相关文章所有 8 个版本

Avoid-df: Audio-visual joint learning for detecting deepfake

W Yang, X Zhou, Z Chen, B Guo, Z Ba… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

Recently, deepfakes have raised severe concerns about the authenticity of online media.
Prior works for deepfake detection have made many efforts to capture the intra-modal …

被引用次数：72 相关文章所有 2 个版本

[PDF] neurips.cc

Clip-it! language-guided video summarization

M Narasimhan, A Rohrbach… - Advances in neural …, 2021 - proceedings.neurips.cc

A generic video summary is an abridged version of a video that conveys the whole story and
features the most important scenes. Yet the importance of scenes in a video is often …

被引用次数：123 相关文章所有 6 个版本

[PDF] thecvf.com

Tsp: Temporally-sensitive pretraining of video encoders for localization tasks

H Alwassel, S Giancola… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Due to the large memory footprint of untrimmed videos, current state-of-the-art video
localization methods operate atop precomputed video clip features. These features are …

被引用次数：139 相关文章所有 10 个版本

高级搜索

QQ 群