Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Distribution-balanced loss for multi-label classification in long-tailed datasets

T Wu, Q Huang, Z Liu, Y Wang, D Lin - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
We present a new loss function called Distribution-Balanced Loss for the multi-label
recognition problems that exhibit long-tailed class distributions. Compared to conventional …

Dual encoding for video retrieval by text

J Dong, X Li, C Xu, X Yang, G Yang… - … on Pattern Analysis …, 2021 - ieeexplore.ieee.org
This paper attacks the challenging problem of video retrieval by text. In such a retrieval
paradigm, an end user searches for unlabeled videos by ad-hoc queries described …

Hit: Hierarchical transformer with momentum contrast for video-text retrieval

S Liu, H Fan, S Qian, Y Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract Video-Text Retrieval has been a hot research topic with the growth of multimedia
data on the internet. Transformer for video-text learning has attracted increasing attention …

Movienet: A holistic dataset for movie understanding

Q Huang, Y Xiong, A Rao, J Wang, D Lin - Computer Vision–ECCV 2020 …, 2020 - Springer
Recent years have seen remarkable advances in visual understanding. However, how to
understand a story-based long video with artistic styles, eg movie, remains challenging. In …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Towards long-form video understanding

CY Wu, P Krahenbuhl - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …

Long movie clip classification with state-space video models

MM Islam, G Bertasius - European Conference on Computer Vision, 2022 - Springer
Most modern video recognition models are designed to operate on short video clips (eg, 5–
10 s in length). Thus, it is challenging to apply such models to long movie understanding …

Computational media intelligence: Human-centered machine analysis of media

K Somandepalli, T Guha, VR Martinez… - Proceedings of the …, 2021 - ieeexplore.ieee.org
Media is created by humans for humans to tell stories. There exists a natural and imminent
need for creating human-centered media analytics to illuminate the stories being told and to …