Multimodal dual attention memory for video story question answering

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

被引用次数：137 相关文章所有 8 个版本

[PDF] thecvf.com

Hierarchical conditional relation networks for video question answering

TM Le, V Le, S Venkatesh… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill
dynamic visual artifacts and distant relations and to associate them with linguistic concepts …

被引用次数：321 相关文章所有 11 个版本

[PDF] acm.org

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

被引用次数：34 相关文章

[PDF] arxiv.org

Multi-modal attention network learning for semantic source code retrieval

Y Wan, J Shu, Y Sui, G Xu, Z Zhao… - 2019 34th IEEE/ACM …, 2019 - ieeexplore.ieee.org

Code retrieval techniques and tools have been playing a key role in facilitating software
developers to retrieve existing code fragments from available open-source repositories …

被引用次数：200 相关文章所有 14 个版本

[PDF] thecvf.com

Bridge to answer: Structure-aware graph interaction network for video question answering

J Park, J Lee, K Sohn - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com

This paper presents a novel method, termed Bridge to Answer, to infer correct answers for
questions about a given video by leveraging adequate graph interactions of heterogeneous …

被引用次数：114 相关文章所有 7 个版本

[PDF] thecvf.com

From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering

J Li, L Niu, L Zhang - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Video understanding has achieved great success in representation learning, such as video
caption, video object grounding, and video descriptive question-answer. However, current …

被引用次数：64 相关文章所有 5 个版本

Triple attention learning for classification of 14 thoracic diseases using chest radiography

H Wang, S Wang, Z Qin, Y Zhang, R Li, Y Xia - Medical Image Analysis, 2021 - Elsevier

Chest X-ray is the most common radiology examinations for the diagnosis of thoracic
diseases. However, due to the complexity of pathological abnormalities and lack of detailed …

被引用次数：120 相关文章所有 4 个版本

[PDF] arxiv.org

Excl: Extractive clip localization using natural language descriptions

S Ghosh, A Agarwal, Z Parekh… - arXiv preprint arXiv …, 2019 - arxiv.org

The task of retrieving clips within videos based on a given natural language query requires
cross-modal reasoning over multiple frames. Prior approaches such as sliding window …

被引用次数：196 相关文章所有 3 个版本

[PDF] thecvf.com

Temporal query networks for fine-grained video understanding

C Zhang, A Gupta, A Zisserman - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Our objective in this work is fine-grained classification of actions in untrimmed videos, where
the actions may be temporally extended or may span only a few frames of the video. We cast …

被引用次数：103 相关文章所有 12 个版本

[PDF] thecvf.com

Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events

L Xu, H Huang, J Liu - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com

Traffic event cognition and reasoning in videos is an important task that has a wide range of
applications in intelligent transportation, assisted driving, and autonomous vehicles. In this …

被引用次数：101 相关文章所有 7 个版本

高级搜索

QQ 群