Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

X Lu, Z Fan, Y Wang, J Oh… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

As an important task in multimodal context understanding, Text-VQA aims at question
answering through reading text information in images. It differentiates from the original VQA …

被引用次数：30 相关文章所有 6 个版本

[PDF] arxiv.org

Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting

P Chen, Y Zhang, Y Cheng, Y Shu, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer-based models have achieved some success in time series forecasting. Existing
methods mainly model time series from limited or fixed scales, making it challenging to …

被引用次数：7 相关文章所有 4 个版本

[PDF] google.com

Neuron-based spiking transmission and reasoning network for robust image-text retrieval

W Li, Z Ma, LJ Deng, X Fan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Most of the image-text retrieval methods carry out accurate results using fine-grained
features for feature alignment. However, extracting the robustness features while …

被引用次数：15 相关文章所有 3 个版本

[PDF] google.com

Beyond OCR+ VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang, N Jiang, G Zhao… - Pattern Recognition, 2023 - Elsevier

Text-based visual question answering (TextVQA), which answers a visual question by
considering both visual contents and scene texts, has attracted increasing attention recently …

被引用次数：12 相关文章所有 4 个版本

[PDF] acm.org

Understanding screen relationships from screenshots of smartphone applications

S Feiz, J Wu, X Zhang, A Swearngin, T Barik… - Proceedings of the 27th …, 2022 - dl.acm.org

All graphical user interfaces are comprised of one or more screens that may be shown to the
user depending on their interactions. Identifying different screens of an app and …

被引用次数：23 相关文章所有 6 个版本

[PDF] arxiv.org

Measuring social biases in grounded vision and language embeddings

C Ross, B Katz, A Barbu - arXiv preprint arXiv:2002.08911, 2020 - arxiv.org

We generalize the notion of social biases from language embeddings to grounded vision
and language embeddings. Biases are present in grounded embeddings, and indeed seem …

被引用次数：54 相关文章所有 8 个版本

[PDF] arxiv.org

Multi-modal image captioning for the visually impaired

H Ahsan, N Bhalla, D Bhatt, K Shah - arXiv preprint arXiv:2105.08106, 2021 - arxiv.org

One of the ways blind people understand their surroundings is by clicking images and
relying on descriptions generated by image captioning systems. Current work on captioning …

被引用次数：37 相关文章所有 5 个版本

Image captioning improved visual question answering

H Sharma, AS Jalal - Multimedia tools and applications, 2022 - Springer

Abstract Both Visual Question Answering (VQA) and image captioning are the problems
which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In …

被引用次数：30 相关文章所有 4 个版本

[PDF] acm.org

Separate and locate: Rethink the text in text-based visual question answering

C Fang, J Li, L Li, C Ma, D Hu - … of the 31st ACM International Conference …, 2023 - dl.acm.org

Text-based Visual Question Answering (TextVQA) aims at answering questions about the
text in images. Most works in this field focus on designing network structures or pre-training …

被引用次数：8 相关文章所有 3 个版本

[PDF] thecvf.com

Prestu: Pre-training for scene-text understanding

J Kil, S Changpinyo, X Chen, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com

The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …

被引用次数：21 相关文章所有 9 个版本

高级搜索

QQ 群