Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arXiv preprint arXiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Univtg: Towards unified video-language temporal grounding

KQ Lin, P Zhang, J Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

A survey on generative ai and llm for video generation, understanding, and streaming

P Zhou, L Wang, Z Liu, Y Hao, P Hui, S Tarkoma… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2025 - Springer
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …