TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple...

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：94 相关文章所有 3 个版本

[PDF] fbk.eu

Findings of the 2021 conference on machine translation (WMT21)

A Farhad, A Arkady, B Magdalena, B Ondřej… - Proceedings of the …, 2021 - cris.fbk.eu

This paper presents the results of the news translation task, the multilingual low-resource
translation for Indo-European languages, the triangular translation task, and the automatic …

被引用次数：177 相关文章所有 19 个版本

[PDF] thecvf.com

Mdmmt: Multidomain multimodal transformer for video retrieval

M Dzabraev, M Kalashnikov… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present a new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC
benchmarks where our model outperforms all previous solutions by a large margin …

被引用次数：143 相关文章所有 7 个版本

[PDF] arxiv.org

A comprehensive review of the video-to-text problem

J Perez-Martin, B Bustos, SJF Guimaraes… - Artificial Intelligence …, 2022 - Springer

Research in the Vision and Language area encompasses challenging topics that seek to
connect visual and textual information. When the visual information is related to videos, this …

被引用次数：17 相关文章所有 8 个版本

[PDF] arxiv.org

Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval

Z Liu, C Xiong, Y Lv, Z Liu, G Yu - arXiv preprint arXiv:2209.00179, 2022 - arxiv.org

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds
a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Actionhub: a large-scale action video description dataset for zero-shot action recognition

J Zhou, J Liang, KY Lin, J Yang, WS Zheng - arXiv preprint arXiv …, 2024 - arxiv.org

Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and
class descriptions of seen actions that is transferable to unseen actions. The text queries …

被引用次数：4 相关文章所有 2 个版本

[PDF] dcu.ie

A task category space for user-centric comparative multimedia search evaluations

J Lokoč, W Bailer, KU Barthel, C Gurrin, S Heller… - … on multimedia modeling, 2022 - Springer

In the last decade, user-centric video search competitions have facilitated the evolution of
interactive video search systems. So far, these competitions focused on a small number of …

被引用次数：24 相关文章所有 11 个版本

[PDF] arxiv.org

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

E Song, W Chai, T Ye, JN Hwang, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization

A Kunitsyn, M Kalashnikov, M Dzabraev… - arXiv preprint arXiv …, 2022 - arxiv.org

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-
VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

MMSys' 22 Grand Challenge on AI-based Video Production for Soccer

C Midoglu, SA Hicks, V Thambawita, T Kupka… - arXiv preprint arXiv …, 2022 - arxiv.org

Soccer has a considerable market share of the global sports industry, and the interest in
viewing videos from soccer games continues to grow. In this respect, it is important to …

被引用次数：14 相关文章所有 2 个版本

高级搜索

QQ 群