Learning video representations from large language models

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org

This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

被引用次数：27 相关文章所有 4 个版本

[PDF] arxiv.org

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

被引用次数：1670 相关文章所有 4 个版本

[PDF] thecvf.com

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

被引用次数：110 相关文章所有 5 个版本

[PDF] thecvf.com

Univtg: Towards unified video-language temporal grounding

KQ Lin, P Zhang, J Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …

被引用次数：107 相关文章所有 4 个版本

[PDF] thecvf.com

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

被引用次数：59 相关文章所有 6 个版本

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

被引用次数：65 相关文章所有 6 个版本

[PDF] aclanthology.org

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

被引用次数：71 相关文章所有 4 个版本

[PDF] arxiv.org

A survey on generative ai and llm for video generation, understanding, and streaming

P Zhou, L Wang, Z Liu, Y Hao, P Hui, S Tarkoma… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …

被引用次数：26 相关文章所有 8 个版本

[PDF] arxiv.org

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

被引用次数：79 相关文章所有 2 个版本

[PDF] arxiv.org

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2025 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

被引用次数：30 相关文章所有 2 个版本

高级搜索

QQ 群