Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Hierarchical conditional relation networks for video question answering

TM Le, V Le, S Venkatesh… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill
dynamic visual artifacts and distant relations and to associate them with linguistic concepts …

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

Multi-modal attention network learning for semantic source code retrieval

Y Wan, J Shu, Y Sui, G Xu, Z Zhao… - 2019 34th IEEE/ACM …, 2019 - ieeexplore.ieee.org
Code retrieval techniques and tools have been playing a key role in facilitating software
developers to retrieve existing code fragments from available open-source repositories …

Bridge to answer: Structure-aware graph interaction network for video question answering

J Park, J Lee, K Sohn - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for
questions about a given video by leveraging adequate graph interactions of heterogeneous …

From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering

J Li, L Niu, L Zhang - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Video understanding has achieved great success in representation learning, such as video
caption, video object grounding, and video descriptive question-answer. However, current …

Triple attention learning for classification of 14 thoracic diseases using chest radiography

H Wang, S Wang, Z Qin, Y Zhang, R Li, Y Xia - Medical Image Analysis, 2021 - Elsevier
Chest X-ray is the most common radiology examinations for the diagnosis of thoracic
diseases. However, due to the complexity of pathological abnormalities and lack of detailed …

Excl: Extractive clip localization using natural language descriptions

S Ghosh, A Agarwal, Z Parekh… - arXiv preprint arXiv …, 2019 - arxiv.org
The task of retrieving clips within videos based on a given natural language query requires
cross-modal reasoning over multiple frames. Prior approaches such as sliding window …

Temporal query networks for fine-grained video understanding

C Zhang, A Gupta, A Zisserman - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Our objective in this work is fine-grained classification of actions in untrimmed videos, where
the actions may be temporally extended or may span only a few frames of the video. We cast …

Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events

L Xu, H Huang, J Liu - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com
Traffic event cognition and reasoning in videos is an important task that has a wide range of
applications in intelligent transportation, assisted driving, and autonomous vehicles. In this …