C Chen, D Han, X Shen - Knowledge-Based Systems, 2023 - Elsevier
The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand …
Abstract Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans …
Abstract Language has been widely acknowledged as the benchmark of intelligence. However, evidence from cognitive science shows that intelligent behaviors in robust social …
Experience precedes understanding. Humans constantly explore and learn about their environment out of curiosity, gather information, and update their models of the world. On the …
LM Abouelmagd, MY Shams, HS Marie… - EURASIP Journal on …, 2024 - Springer
Plant diseases have a significant impact on leaves, with each disease exhibiting specific spots characterized by unique colors and locations. Therefore, it is crucial to develop a …
The aim of the image captioning task is to understand various semantic concepts such as objects and their relationships in an image and combine them to generate a natural …
L Zhu, L Peng, W Zhou, J Yang - Pattern Recognition Letters, 2023 - Elsevier
Abstract Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art …
H Sharma, AS Jalal - Multimedia Tools and Applications, 2022 - Springer
The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images …
Visual inspection is an important process for maintaining bridges in road transportation systems, and preventing catastrophic events and tragedies. In this process, accurate and …