- 学术资源搜索

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org

Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

被引用次数：355 相关文章所有 3 个版本

Deep learning in microscopy image analysis: A survey

F Xing, Y Xie, H Su, F Liu, L Yang - IEEE transactions on neural …, 2017 - ieeexplore.ieee.org

Computerized microscopy image analysis plays an important role in computer aided
diagnosis and prognosis. Machine learning techniques have powered many aspects of …

被引用次数：415 相关文章所有 4 个版本

[PDF] arxiv.org

A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

被引用次数：239 相关文章所有 5 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：103 相关文章所有 4 个版本

[PDF] thecvf.com

Next-qa: Next phase of question-answering to explaining temporal actions

J Xiao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

被引用次数：224 相关文章所有 6 个版本

[PDF] thecvf.com

Deep modular co-attention networks for visual question answering

Z Yu, J Yu, Y Cui, D Tao, Q Tian - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …

被引用次数：935 相关文章所有 11 个版本

[PDF] thecvf.com

Gqa: A new dataset for real-world visual reasoning and compositional question answering

DA Hudson, CD Manning - … of the IEEE/CVF conference on …, 2019 - openaccess.thecvf.com

We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

被引用次数：1453 相关文章所有 8 个版本

[PDF] thecvf.com

Ok-vqa: A visual question answering benchmark requiring external knowledge

K Marino, M Rastegari, A Farhadi… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the
joint space of vision and language and serves as a proxy for the AI task of scene …

被引用次数：765 相关文章所有 8 个版本

[PDF] thecvf.com

Towards vqa models that can read

A Singh, V Natarajan, M Shah… - Proceedings of the …, 2019 - openaccess.thecvf.com

Studies have shown that a dominant class of questions asked by visually impaired users on
images of their surroundings involves reading text in the image. But today's VQA models can …

被引用次数：686 相关文章所有 8 个版本

[PDF] arxiv.org

The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision

J Mao, C Gan, P Kohli, JB Tenenbaum, J Wu - arXiv preprint arXiv …, 2019 - arxiv.org

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual
concepts, words, and semantic parsing of sentences without explicit supervision on any of …

被引用次数：773 相关文章所有 6 个版本

高级搜索

QQ 群

Multimodal intelligence: Representation learning, information fusion, and applications

Deep learning in microscopy image analysis: A survey

A-okvqa: A benchmark for visual question answering using world knowledge

Pali-x: On scaling up a multilingual vision and language model

Next-qa: Next phase of question-answering to explaining temporal actions

Deep modular co-attention networks for visual question answering

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Ok-vqa: A visual question answering benchmark requiring external knowledge

Towards vqa models that can read

The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision

引用