Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Grounding Language Models to Images for Multimodal Inputs and Outputs

JY Koh, R Salakhutdinov, D Fried - arXiv preprint arXiv:2301.13823, 2023 - arxiv.org
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

[PDF][PDF] Grounding Language Models to Images for Multimodal Inputs and Outputs

JY Koh - 2023 - jykoh.com
[London ML Meetup] Grounding Language Models for Contextual Multi-Modal Generation
Page 1 Grounding Language Models to Images for Multimodal Inputs and Outputs Jing Yu …

Grounding Language Models to Images for Multimodal Inputs and Outputs

JY Koh, R Salakhutdinov, D Fried - openreview.net
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov, D Fried - Proceedings of the 40th International …, 2023 - dl.acm.org
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

[PDF][PDF] Grounding Language Models to Images for Multimodal Inputs and Outputs

JY Koh, R Salakhutdinov, D Fried - proceedings.mlr.press
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Grounding Language Models to Images for Multimodal Inputs and Outputs

JY Koh, R Salakhutdinov, D Fried - arXiv e-prints, 2023 - ui.adsabs.harvard.edu
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …