Llava-med: Training a large language-and-vision assistant for biomedicine in one day

C Li, C Wong, S Zhang, N Usuyama… - Advances in …, 2024 - proceedings.neurips.cc
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigations focus on unimodal text. Multimodal …

Gligen: Open-set grounded text-to-image generation

Y Li, H Liu, Q Wu, F Mu, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale text-to-image diffusion models have made amazing advances. However, the
status quo is to use text input alone, which can impede controllability. In this work, we …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

An inverse scaling law for clip training

X Li, Z Wang, C Xie - Advances in Neural Information …, 2024 - proceedings.neurips.cc
CLIP, one of the pioneering foundation models that connect images and text, has enabled
many recent breakthroughs in computer vision. However, its associated training cost is …

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

Scaling vision-language models with sparse mixture of experts

S Shen, Z Yao, C Li, T Darrell, K Keutzer… - arXiv preprint arXiv …, 2023 - arxiv.org
The field of natural language processing (NLP) has made significant strides in recent years,
particularly in the development of large-scale vision-language models (VLMs). These …

Text-guided foundation model adaptation for pathological image classification

Y Zhang, J Gao, M Zhou, X Wang, Y Qiao… - … Conference on Medical …, 2023 - Springer
The recent surge of foundation models in computer vision and natural language processing
opens up perspectives in utilizing multi-modal clinical data to train large models with strong …

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Neural priming for sample-efficient adaptation

M Wallingford, V Ramanujan, A Fang… - Advances in …, 2024 - proceedings.neurips.cc
Abstract We propose Neural Priming, a technique for adapting large pretrained models to
distribution shifts and downstream tasks given few or no labeled examples. Presented with …

Retrieval-enhanced contrastive vision-text models

A Iscen, M Caron, A Fathi, C Schmid - arXiv preprint arXiv:2306.07196, 2023 - arxiv.org
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art
systems. While they excel at recognizing common generic concepts, they still struggle on …