Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org
Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Video captioning: a review of theory, techniques and practices.

V Jain, F Al-Turjman, G Chaudhary… - Multimedia Tools & …, 2022 - search.ebscohost.com
In today's world, video captioning is extensively used in various applications for specially-
abled and, more specifically, visually abled persons. With advancements in technology for …

Hierarchical conditional relation networks for video question answering

TM Le, V Le, S Venkatesh… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill
dynamic visual artifacts and distant relations and to associate them with linguistic concepts …

Predicting human eye fixations via an lstm-based saliency attentive model

M Cornia, L Baraldi, G Serra… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
Data-driven saliency has recently gained a lot of attention thanks to the use of convolutional
neural networks for predicting gaze fixations. In this paper, we go beyond standard …

Working memory connections for LSTM

F Landi, L Baraldi, M Cornia, R Cucchiara - Neural Networks, 2021 - Elsevier
Abstract Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of
gating mechanisms to mitigate exploding and vanishing gradients when learning long-term …

Semantic grouping network for video captioning

H Ryu, S Kang, H Kang, CD Yoo - … of the AAAI Conference on Artificial …, 2021 - ojs.aaai.org
This paper considers a video caption generating network referred to as Semantic Grouping
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …

Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning

N Aafaq, N Akhtar, W Liu, SZ Gilani… - Proceedings of the …, 2019 - openaccess.thecvf.com
Automatic generation of video captions is a fundamental challenge in computer vision.
Recent techniques typically employ a combination of Convolutional Neural Networks …

Syntax-aware action targeting for video captioning

Q Zheng, C Wang, D Tao - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Existing methods on video captioning have made great efforts to identify objects/instances in
videos, but few of them emphasize the prediction of action. As a result, the learned models …

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …