Transformer-based models have achieved some success in time series forecasting. Existing methods mainly model time series from limited or fixed scales, making it challenging to …
W Li, Z Ma, LJ Deng, X Fan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Most of the image-text retrieval methods carry out accurate results using fine-grained features for feature alignment. However, extracting the robustness features while …
G Zeng, Y Zhang, Y Zhou, X Yang, N Jiang, G Zhao… - Pattern Recognition, 2023 - Elsevier
Text-based visual question answering (TextVQA), which answers a visual question by considering both visual contents and scene texts, has attracted increasing attention recently …
All graphical user interfaces are comprised of one or more screens that may be shown to the user depending on their interactions. Identifying different screens of an app and …
We generalize the notion of social biases from language embeddings to grounded vision and language embeddings. Biases are present in grounded embeddings, and indeed seem …
H Ahsan, N Bhalla, D Bhatt, K Shah - arXiv preprint arXiv:2105.08106, 2021 - arxiv.org
One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems. Current work on captioning …
H Sharma, AS Jalal - Multimedia tools and applications, 2022 - Springer
Abstract Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In …
C Fang, J Li, L Li, C Ma, D Hu - … of the 31st ACM International Conference …, 2023 - dl.acm.org
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training …
The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …