Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

Image segmentation using text and image prompts

T Lüddecke, A Ecker - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Image segmentation is usually addressed by training a model for a fixed set of object
classes. Incorporating additional classes or more complex queries later is expensive as it …

Cris: Clip-driven referring image segmentation

Z Wang, Y Lu, Q Li, X Tao, Y Guo… - Proceedings of the …, 2022 - openaccess.thecvf.com
Referring image segmentation aims to segment a referent via a natural linguistic expression.
Due to the distinct data properties between text and image, it is challenging for a network to …

Lavt: Language-aware vision transformer for referring image segmentation

Z Yang, J Wang, Y Tang, K Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Referring image segmentation is a fundamental vision-language task that aims to segment
out an object referred to by a natural language expression from an image. One of the key …

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer
Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

Restr: Convolution-free referring image segmentation using transformers

N Kim, D Kim, C Lan, W Zeng… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Referring image segmentation is an advanced semantic segmentation task where target is
not a predefined class but is described in natural language. Most of existing methods for this …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …