MMF: a multimodal framework for vision and language research (2020)

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

被引用次数：254 相关文章所有 5 个版本

[PDF] arxiv.org

Textcaps: a dataset for image captioning with reading comprehension

O Sidorov, R Hu, M Rohrbach, A Singh - … 23–28, 2020, Proceedings, Part II …, 2020 - Springer

Image descriptions can help visually impaired people to quickly understand the image
content. While we made significant progress in automatically describing images and optical …

被引用次数：296 相关文章所有 4 个版本

[PDF] arxiv.org

Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer

Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

被引用次数：40 相关文章所有 7 个版本

A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

H Ding, S Wang, Z Xie, M Li, L Ma - Computers & Graphics, 2023 - Elsevier

Vision and language representation learning has been demonstrated to be an effective
means of enhancing multimodal task performance. However, fashion-specific studies have …

被引用次数：3 相关文章所有 2 个版本

[HTML] mdpi.com

[HTML][HTML] A metamorphic testing approach for assessing question answering systems

K Tu, M Jiang, Z Ding - Mathematics, 2021 - mdpi.com

Question Answering (QA) enables the machine to understand and answer questions posed
in natural language, which has emerged as a powerful tool in various domains. However …

被引用次数：10 相关文章所有 6 个版本

[PDF] gla.ac.uk

Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification

F Dalla Serra, G Jacenków, F Deligianni… - Annual Conference on …, 2022 - Springer

Multimodal learning, here defined as learning from multiple input data types, has exciting
potential for healthcare. However, current techniques rely on large multimodal datasets …

被引用次数：2 相关文章所有 3 个版本

[PDF] ceur-ws.org

[PDF][PDF] Amazon pars at memotion 2.0 2022: Multi-modal multi-task learning for memotion 2.0 challenge

GG Lee, M Shen - Proceedings http://ceur-ws. org ISSN, 2020 - ceur-ws.org

Over the years, memes became very popular as social media services growing rapidly.
Understanding meme images as humans do is very complicated because of its multi-modal …

被引用次数：7 相关文章所有 2 个版本

Visual Question Answering for Response Synthesis Based on Spatial Actions

G Kiselev, D Weizenfeld, Y Gorbunova - International Conference on …, 2022 - Springer

The paper considers the automatic analysis problem of a user's natural language query from
an image. The mechanism synthesizes a logically correct non-binary response. Synthesis is …

Towards multilingual image captioning models that can read

R Gallardo García, B Beltrán Martínez… - … Conference on Artificial …, 2021 - Springer

Few current image captioning systems are capable to read and integrate read text into the
generated descriptions, none of them was developed to solve the problem from a bilingual …

被引用次数：1 相关文章所有 2 个版本

Check for updates

N Yevtushenko¹, V Kuliamin¹… - Testing Software and …, 2019 - books.google.com

Homing, synchronizing and distinguishing sequences (HSs, SSs, and DSs) are used in FSM
(Finite State Machine) based testing for state identification and can significantly reduce the …

高级搜索

QQ 群