Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Findings of the 2021 conference on machine translation (WMT21)

A Farhad, A Arkady, B Magdalena, B Ondřej… - Proceedings of the …, 2021 - cris.fbk.eu
This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …

Mdmmt: Multidomain multimodal transformer for video retrieval

M Dzabraev, M Kalashnikov… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present a new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC
benchmarks where our model outperforms all previous solutions by a large margin …

A comprehensive review of the video-to-text problem

J Perez-Martin, B Bustos, SJF Guimaraes… - Artificial Intelligence …, 2022 - Springer
Research in the Vision and Language area encompasses challenging topics that seek to
connect visual and textual information. When the visual information is related to videos, this …

Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval

Z Liu, C Xiong, Y Lv, Z Liu, G Yu - arXiv preprint arXiv:2209.00179, 2022 - arxiv.org
This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds
a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality …

Actionhub: a large-scale action video description dataset for zero-shot action recognition

J Zhou, J Liang, KY Lin, J Yang, WS Zheng - arXiv preprint arXiv …, 2024 - arxiv.org
Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and
class descriptions of seen actions that is transferable to unseen actions. The text queries …

A task category space for user-centric comparative multimedia search evaluations

J Lokoč, W Bailer, KU Barthel, C Gurrin, S Heller… - … on multimedia modeling, 2022 - Springer
In the last decade, user-centric video search competitions have facilitated the evolution of
interactive video search systems. So far, these competitions focused on a small number of …

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

E Song, W Chai, T Ye, JN Hwang, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization

A Kunitsyn, M Kalashnikov, M Dzabraev… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-
VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data …

MMSys' 22 Grand Challenge on AI-based Video Production for Soccer

C Midoglu, SA Hicks, V Thambawita, T Kupka… - arXiv preprint arXiv …, 2022 - arxiv.org
Soccer has a considerable market share of the global sports industry, and the interest in
viewing videos from soccer games continues to grow. In this respect, it is important to …