Sigmoid loss for language image pre-training

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

被引用次数：54 相关文章所有 4 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：98 相关文章所有 6 个版本

[PDF] neurips.cc

Scaling open-vocabulary object detection

M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc

Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …

被引用次数：61 相关文章所有 6 个版本

[PDF] arxiv.org

A survey on vision mamba: Models, applications and challenges

R Xu, S Yang, Y Wang, B Du, H Chen - arXiv preprint arXiv:2404.18861, 2024 - arxiv.org

Mamba, a recent selective structured state space model, performs excellently on long
sequence modeling tasks. Mamba mitigates the modeling constraints of convolutional …

被引用次数：11 相关文章所有 2 个版本

[PDF] neurips.cc

Image captioners are scalable vision learners too

M Tschannen, M Kumar, A Steiner… - Advances in …, 2024 - proceedings.neurips.cc

Contrastive pretraining on image-text pairs from the web is one of the most popular large-
scale pretraining strategies for vision backbones, especially in the context of large …

被引用次数：32 相关文章所有 5 个版本

[PDF] thecvf.com

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

被引用次数：13 相关文章所有 3 个版本

[PDF] mlr.press

Enhancing activity prediction models in drug discovery with the ability to understand human language

P Seidl, A Vall, S Hochreiter… - … on Machine Learning, 2023 - proceedings.mlr.press

Activity and property prediction models are the central workhorses in drug discovery and
materials sciences, but currently, they have to be trained or fine-tuned for new tasks. Without …

被引用次数：38 相关文章所有 6 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：39 相关文章所有 2 个版本

[PDF] arxiv.org

Pali-3 vision language models: Smaller, faster, stronger

X Chen, X Wang, L Beyer, A Kolesnikov, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that
compares favorably to similar models that are 10x larger. As part of arriving at this strong …

被引用次数：36 相关文章所有 3 个版本

[PDF] arxiv.org

Remoteclip: A vision language foundation model for remote sensing

F Liu, D Chen, Z Guan, X Zhou, J Zhu… - … on Geoscience and …, 2024 - ieeexplore.ieee.org

General-purpose foundation models have led to recent breakthroughs in artificial
intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image …

被引用次数：47 相关文章所有 3 个版本

高级搜索

QQ 群