Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Vtimellm: Empower llm to grasp video moments

B Huang, X Wang, H Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models (LLMs) have shown remarkable text understanding capabilities
which have been extended as Video LLMs to handle video data for comprehending visual …

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Avoid-df: Audio-visual joint learning for detecting deepfake

W Yang, X Zhou, Z Chen, B Guo, Z Ba… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Recently, deepfakes have raised severe concerns about the authenticity of online media.
Prior works for deepfake detection have made many efforts to capture the intra-modal …

Clip-it! language-guided video summarization

M Narasimhan, A Rohrbach… - Advances in neural …, 2021 - proceedings.neurips.cc
A generic video summary is an abridged version of a video that conveys the whole story and
features the most important scenes. Yet the importance of scenes in a video is often …

Tsp: Temporally-sensitive pretraining of video encoders for localization tasks

H Alwassel, S Giancola… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Due to the large memory footprint of untrimmed videos, current state-of-the-art video
localization methods operate atop precomputed video clip features. These features are …