Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …

Howtocaption: Prompting llms to transform video annotations at scale

N Shvetsova, A Kukleva, X Hong, C Rupprecht… - … on Computer Vision, 2025 - Springer
Instructional videos are a common source for learning text-video or even multimodal
representations by leveraging subtitles extracted with automatic speech recognition systems …

MOSAIC: Multimodal Multistakeholder-aware Visual Art Recommendation

BA Yilma, LA Leiva - arXiv preprint arXiv:2407.21758, 2024 - arxiv.org
Visual art (VA) recommendation is complex, as it has to consider the interests of users (eg
museum visitors) and other stakeholders (eg museum curators). We study how to effectively …

iRAG: An Incremental Retrieval Augmented Generation System for Videos

MA Arefeen, B Debnath, MYS Uddin… - arXiv preprint arXiv …, 2024 - arxiv.org
Retrieval augmented generation (RAG) systems combine the strengths of language
generation and information retrieval to power many real-world applications like chatbots …

Multimodal Isotropic Neural Architecture with Patch Embedding

H Truchan, E Naumov, R Abedin, G Palmer… - … Conference on Neural …, 2023 - Springer
Patch embedding has been a significant advancement in Transformer-based models,
particularly the Vision Transformer (ViT), as it enables handling larger image sizes and …

Introducing SSBD+ Dataset with a Convolutional Pipeline for detecting Self-Stimulatory Behaviours in Children using raw videos

V Lokegaonkar, V Jaisankar, P Deepika… - … Conference on E …, 2023 - ieeexplore.ieee.org
Conventionally, evaluation for the diagnosis of Autism spectrum disorder is done by a
trained specialist through questionnaire-based formal assessments and by observation of …

Understanding, Building, and Evaluating Models for Context Aware Conditional Natural Language Generation

DM Chan - 2024 - search.proquest.com
If you ask a human to describe an image, they might do so in a thousand different ways.
Each of these descriptions depends not only on the image but also on a rich tapestry of …

Advancing image and video recognition with less supervision

A Kukleva - 2024 - publikationen.sulb.uni-saarland.de
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and
enhances quality of life across various domains such as entertainment, learning, automatic …

Multi-stage multi-modal pre-training for automatic speech recognition

Y Jain, D Chan, P Dheram, A Khare… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in machine learning have demonstrated that multi-modal pre-training can
improve automatic speech recognition (ASR) performance compared to randomly initialized …