Localize, group, and select: Boosting text-vqa by scene text modeling

X Lu, Z Fan, Y Wang, J Oh… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
As an important task in multimodal context understanding, Text-VQA aims at question
answering through reading text information in images. It differentiates from the original VQA …

Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting

P Chen, Y Zhang, Y Cheng, Y Shu, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based models have achieved some success in time series forecasting. Existing
methods mainly model time series from limited or fixed scales, making it challenging to …

Neuron-based spiking transmission and reasoning network for robust image-text retrieval

W Li, Z Ma, LJ Deng, X Fan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Most of the image-text retrieval methods carry out accurate results using fine-grained
features for feature alignment. However, extracting the robustness features while …

Beyond OCR+ VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang, N Jiang, G Zhao… - Pattern Recognition, 2023 - Elsevier
Text-based visual question answering (TextVQA), which answers a visual question by
considering both visual contents and scene texts, has attracted increasing attention recently …

Understanding screen relationships from screenshots of smartphone applications

S Feiz, J Wu, X Zhang, A Swearngin, T Barik… - Proceedings of the 27th …, 2022 - dl.acm.org
All graphical user interfaces are comprised of one or more screens that may be shown to the
user depending on their interactions. Identifying different screens of an app and …

Measuring social biases in grounded vision and language embeddings

C Ross, B Katz, A Barbu - arXiv preprint arXiv:2002.08911, 2020 - arxiv.org
We generalize the notion of social biases from language embeddings to grounded vision
and language embeddings. Biases are present in grounded embeddings, and indeed seem …

Multi-modal image captioning for the visually impaired

H Ahsan, N Bhalla, D Bhatt, K Shah - arXiv preprint arXiv:2105.08106, 2021 - arxiv.org
One of the ways blind people understand their surroundings is by clicking images and
relying on descriptions generated by image captioning systems. Current work on captioning …

Image captioning improved visual question answering

H Sharma, AS Jalal - Multimedia tools and applications, 2022 - Springer
Abstract Both Visual Question Answering (VQA) and image captioning are the problems
which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In …

Separate and locate: Rethink the text in text-based visual question answering

C Fang, J Li, L Li, C Ma, D Hu - … of the 31st ACM International Conference …, 2023 - dl.acm.org
Text-based Visual Question Answering (TextVQA) aims at answering questions about the
text in images. Most works in this field focus on designing network structures or pre-training …

Prestu: Pre-training for scene-text understanding

J Kil, S Changpinyo, X Chen, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …