Unified contrastive learning in image-text-label space

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] arxiv.org

Foundational models defining a new era in vision: A survey and outlook

M Awais, M Naseer, S Khan, RM Anwer… - arXiv preprint arXiv …, 2023 - arxiv.org

Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

被引用次数：62 相关文章所有 2 个版本

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

被引用次数：2333 相关文章所有 15 个版本

[PDF] neurips.cc

Segment everything everywhere all at once

X Zou, J Yang, H Zhang, F Li, L Li… - Advances in …, 2024 - proceedings.neurips.cc

In this work, we present SEEM, a promotable and interactive model for segmenting
everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …

被引用次数：316 相关文章所有 5 个版本

[PDF] thecvf.com

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

被引用次数：187 相关文章所有 5 个版本

[PDF] thecvf.com

Generalized decoding for pixel, image, and language

X Zou, ZY Dou, J Yang, Z Gan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present X-Decoder, a generalized decoding model that can predict pixel-level
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …

被引用次数：170 相关文章所有 6 个版本

[PDF] neurips.cc

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

被引用次数：224 相关文章所有 4 个版本

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：155 相关文章所有 9 个版本

[PDF] thecvf.com

A simple framework for open-vocabulary segmentation and detection

H Zhang, F Li, X Zou, S Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that learns from different segmentation and detection datasets. To bridge the gap …

被引用次数：100 相关文章所有 5 个版本

[PDF] ecva.net

`TEMOS`: Generating Diverse Human Motions from Textual Descriptions

M Petrovich, MJ Black, G Varol - European Conference on Computer …, 2022 - Springer

We address the problem of generating diverse 3D human motions from textual descriptions.
This challenging task requires joint modeling of both modalities: understanding and …

被引用次数：226 相关文章所有 6 个版本

高级搜索

QQ 群