Generation and comprehension of unambiguous object descriptions

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：148 相关文章所有 7 个版本

[PDF] arxiv.org

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

被引用次数：2283 相关文章所有 8 个版本

[PDF] openreview.net

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

被引用次数：320 相关文章所有 3 个版本

[PDF] thecvf.com

Image segmentation using text and image prompts

T Lüddecke, A Ecker - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Image segmentation is usually addressed by training a model for a fixed set of object
classes. Incorporating additional classes or more complex queries later is expensive as it …

被引用次数：319 相关文章所有 7 个版本

[PDF] thecvf.com

Cris: Clip-driven referring image segmentation

Z Wang, Y Lu, Q Li, X Tao, Y Guo… - Proceedings of the …, 2022 - openaccess.thecvf.com

Referring image segmentation aims to segment a referent via a natural linguistic expression.
Due to the distinct data properties between text and image, it is challenging for a network to …

被引用次数：272 相关文章所有 7 个版本

[PDF] thecvf.com

Lavt: Language-aware vision transformer for referring image segmentation

Z Yang, J Wang, Y Tang, K Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Referring image segmentation is a fundamental vision-language task that aims to segment
out an object referred to by a natural language expression from an image. One of the key …

被引用次数：226 相关文章所有 10 个版本

[PDF] arxiv.org

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer

Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

被引用次数：157 相关文章所有 9 个版本

[PDF] thecvf.com

Restr: Convolution-free referring image segmentation using transformers

N Kim, D Kim, C Lan, W Zeng… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Referring image segmentation is an advanced semantic segmentation task where target is
not a predefined class but is described in natural language. Most of existing methods for this …

被引用次数：123 相关文章所有 7 个版本

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：127 相关文章所有 2 个版本

[PDF] arxiv.org

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

被引用次数：95 相关文章所有 7 个版本

高级搜索

QQ 群