相关文章- 学术资源搜索

Pix2struct: Screenshot parsing as pretraining for visual language understanding

K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press

Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …

被引用次数：154 相关文章所有 7 个版本

[PDF] aaai.org

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

被引用次数：905 相关文章所有 12 个版本

[PDF] mlr.press

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press

Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

被引用次数：2748 相关文章所有 5 个版本

[PDF] thecvf.com

Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

被引用次数：10 相关文章所有 3 个版本

[PDF] thecvf.com

Hrvda: High-resolution visual document assistant

C Liu, K Yin, H Cao, X Jiang, X Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

Leveraging vast training data multimodal large language models (MLLMs) have
demonstrated formidable general visual comprehension capabilities and achieved …

被引用次数：7 相关文章所有 3 个版本

[PDF] thecvf.com

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

被引用次数：63 相关文章所有 4 个版本

[PDF] neurips.cc

Bootstrapping vision-language learning with decoupled language pre-training

Y Jian, C Gao, S Vosoughi - Advances in Neural …, 2024 - proceedings.neurips.cc

We present a novel methodology aimed at optimizing the application of frozen large
language models (LLMs) for resource-intensive vision-language (VL) pre-training. The …

被引用次数：13 相关文章所有 6 个版本

[PDF] neurips.cc

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

J Lu, D Batra, D Parikh, S Lee - Advances in neural …, 2019 - proceedings.neurips.cc

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-
agnostic joint representations of image content and natural language. We extend the …

被引用次数：3631 相关文章所有 8 个版本

[PDF] mlr.press

Unifying vision-and-language tasks via text generation

J Cho, J Lei, H Tan, M Bansal - International Conference on …, 2021 - proceedings.mlr.press

Existing methods for vision-and-language learning typically require designing task-specific
architectures and objectives for each task. For example, a multi-label answer classifier for …

被引用次数：488 相关文章所有 6 个版本

[PDF] arxiv.org

Unified language-vision pretraining with dynamic discrete visual tokenization

Y Jin, K Xu, L Chen, C Liao, J Tan, B Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, the remarkable advance of the Large Language Model (LLM) has inspired
researchers to transfer its extraordinary reasoning capability to data across several …

被引用次数：26 相关文章所有 3 个版本

高级搜索

QQ 群

Pix2struct: Screenshot parsing as pretraining for visual language understanding

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Lion: Empowering multimodal large language model with dual-level visual knowledge

Hrvda: High-resolution visual document assistant

Vila: On pre-training for visual language models

Bootstrapping vision-language learning with decoupled language pre-training

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Unifying vision-and-language tasks via text generation

Unified language-vision pretraining with dynamic discrete visual tokenization

引用