- 学术资源搜索

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

C Li, C Wong, S Zhang, N Usuyama… - Advances in …, 2024 - proceedings.neurips.cc

Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigations focus on unimodal text. Multimodal …

被引用次数：343 相关文章所有 6 个版本

[PDF] thecvf.com

Gligen: Open-set grounded text-to-image generation

Y Li, H Liu, Q Wu, F Mu, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we …

被引用次数：189 相关文章所有 5 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：142 相关文章所有 6 个版本

[PDF] neurips.cc

An inverse scaling law for clip training

X Li, Z Wang, C Xie - Advances in Neural Information …, 2024 - proceedings.neurips.cc

CLIP, one of the pioneering foundation models that connect images and text, has enabled
many recent breakthroughs in computer vision. However, its associated training cost is …

被引用次数：28 相关文章所有 5 个版本

[PDF] thecvf.com

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Scaling vision-language models with sparse mixture of experts

S Shen, Z Yao, C Li, T Darrell, K Keutzer… - arXiv preprint arXiv …, 2023 - arxiv.org

The field of natural language processing (NLP) has made significant strides in recent years,
particularly in the development of large-scale vision-language models (VLMs). These …

被引用次数：39 相关文章所有 4 个版本

[PDF] arxiv.org

Text-guided foundation model adaptation for pathological image classification

Y Zhang, J Gao, M Zhou, X Wang, Y Qiao… - … Conference on Medical …, 2023 - Springer

The recent surge of foundation models in computer vision and natural language processing
opens up perspectives in utilizing multi-modal clinical data to train large models with strong …

被引用次数：29 相关文章所有 4 个版本

[PDF] thecvf.com

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

被引用次数：4 相关文章所有 5 个版本

[PDF] neurips.cc

Neural priming for sample-efficient adaptation

M Wallingford, V Ramanujan, A Fang… - Advances in …, 2024 - proceedings.neurips.cc

Abstract We propose Neural Priming, a technique for adapting large pretrained models to
distribution shifts and downstream tasks given few or no labeled examples. Presented with …

被引用次数：9 相关文章所有 6 个版本

[PDF] arxiv.org

Retrieval-enhanced contrastive vision-text models

A Iscen, M Caron, A Fathi, C Schmid - arXiv preprint arXiv:2306.07196, 2023 - arxiv.org

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art
systems. While they excel at recognizing common generic concepts, they still struggle on …

被引用次数：16 相关文章所有 3 个版本

高级搜索

QQ 群

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Gligen: Open-set grounded text-to-image generation

Multimodal foundation models: From specialists to general-purpose assistants

An inverse scaling law for clip training

The Neglected Tails in Vision-Language Models

Scaling vision-language models with sparse mixture of experts

Text-guided foundation model adaptation for pathological image classification

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Neural priming for sample-efficient adaptation

Retrieval-enhanced contrastive vision-text models

引用