Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems …
BA Yilma, LA Leiva - arXiv preprint arXiv:2407.21758, 2024 - arxiv.org
Visual art (VA) recommendation is complex, as it has to consider the interests of users (eg museum visitors) and other stakeholders (eg museum curators). We study how to effectively …
MA Arefeen, B Debnath, MYS Uddin… - arXiv preprint arXiv …, 2024 - arxiv.org
Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots …
H Truchan, E Naumov, R Abedin, G Palmer… - … Conference on Neural …, 2023 - Springer
Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and …
Conventionally, evaluation for the diagnosis of Autism spectrum disorder is done by a trained specialist through questionnaire-based formal assessments and by observation of …
If you ask a human to describe an image, they might do so in a thousand different ways. Each of these descriptions depends not only on the image but also on a rich tapestry of …
A Kukleva - 2024 - publikationen.sulb.uni-saarland.de
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic …
Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized …