Scalable and accurate self-supervised multimodal representation learning without aligned video and text data

V Lialin, S Rawls, D Chan, S Ghosh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scaling up weakly-supervised datasets has shown to be highly effective in the image-text
domain and has contributed to most of the recent state-of-the-art computer vision and …

A comprehensive review on recent methods and challenges of video description

A Singh, TD Singh, S Bandyopadhyay - arXiv preprint arXiv:2011.14752, 2020 - arxiv.org
Video description involves the generation of the natural language description of actions,
events, and objects in the video. There are various applications of video description by filling …

Deep learning based video captioning in bengali

AH Raj, A Seum, A Dash, S Islam… - 2021 26th International …, 2021 - ieeexplore.ieee.org
Generating meaningful textual descriptions from visual contents having the context in
consideration is very challenging in terms of Natural Language Processing (NLP) and …

Automatic generation of natural language descriptions of visual data: describing images and videos using recurrent and self-attentive models

P Harzig - 2022 - opus.bibliothek.uni-augsburg.de
Humans are faced with a constant flow of visual stimuli, eg, from the environment or when
looking at social media. In contrast, visually-impaired people are often incapable to perceive …

[PDF][PDF] A Comprehensive Review on Recent Methods and Challenges of Video

A SINGH, TD SINGH… - arXiv preprint arXiv …, 2020 - academia.edu
Authors' address: Alok Singh, alok_rs@ cse. nits. ac. in; Thoudam Doren Singh, doren@
cse. nits. ac. in; Sivaji Bandyopadhyay, sivaji. cse. ju@ gmail. com, Centre for Natural …