Asif: Coupled data turns unimodal models to multimodal without training

A Norelli, M Fumero, V Maiorca… - Advances in …, 2024 - proceedings.neurips.cc
CLIP proved that aligning visual and language spaces is key to solving many vision tasks
without explicit training, but required to train image and text encoders from scratch on a huge …

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Building Vision-Language Models on Solid Foundations with Masked Distillation

S Sameni, K Kafle, H Tan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recent advancements in Vision-Language Models (VLMs) have marked a
significant leap in bridging the gap between computer vision and natural language …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Harmonized multimodal learning with Gaussian process latent variable models

G Song, S Wang, Q Huang… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
Multimodal learning aims to discover the relationship between multiple modalities. It has
become an important research topic due to extensive multimodal applications such as cross …

MULTIZOO & MULTIBENCH: a standardized toolkit for multimodal deep learning

PP Liang, Y Lyu, X Fan, A Agarwal, Y Cheng… - The Journal of Machine …, 2023 - dl.acm.org
Learning multimodal representations involves integrating information from multiple
heterogeneous sources of data. In order to accelerate progress towards understudied …

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2023 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

What makes multi-modal learning better than single (provably)

Y Huang, C Du, Z Xue, X Chen… - Advances in Neural …, 2021 - proceedings.neurips.cc
The world provides us with data of multiple modalities. Intuitively, models fusing data from
different modalities outperform their uni-modal counterparts, since more information is …