相关文章- 学术资源搜索

Asif: Coupled data turns unimodal models to multimodal without training

A Norelli, M Fumero, V Maiorca… - Advances in …, 2024 - proceedings.neurips.cc

CLIP proved that aligning visual and language spaces is key to solving many vision tasks
without explicit training, but required to train image and text encoders from scratch on a huge …

被引用次数：17 相关文章所有 6 个版本

[PDF] neurips.cc

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

被引用次数：1885 相关文章所有 12 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] thecvf.com

Building Vision-Language Models on Solid Foundations with Masked Distillation

S Sameni, K Kafle, H Tan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract Recent advancements in Vision-Language Models (VLMs) have marked a
significant leap in bridging the gap between computer vision and natural language …

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：119 相关文章所有 2 个版本

[PDF] mlr.press

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press

We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

被引用次数：124 相关文章所有 7 个版本

[PDF] arxiv.org

Harmonized multimodal learning with Gaussian process latent variable models

G Song, S Wang, Q Huang… - IEEE transactions on …, 2019 - ieeexplore.ieee.org

Multimodal learning aims to discover the relationship between multiple modalities. It has
become an important research topic due to extensive multimodal applications such as cross …

被引用次数：24 相关文章所有 8 个版本

[PDF] jmlr.org

MULTIZOO & MULTIBENCH: a standardized toolkit for multimodal deep learning

PP Liang, Y Lyu, X Fan, A Agarwal, Y Cheng… - The Journal of Machine …, 2023 - dl.acm.org

Learning multimodal representations involves integrating information from multiple
heterogeneous sources of data. In order to accelerate progress towards understudied …

被引用次数：2 相关文章所有 3 个版本

[PDF] acm.org

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2023 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：9 相关文章

[PDF] neurips.cc

What makes multi-modal learning better than single (provably)

Y Huang, C Du, Z Xue, X Chen… - Advances in Neural …, 2021 - proceedings.neurips.cc

The world provides us with data of multiple modalities. Intuitively, models fusing data from
different modalities outperform their uni-modal counterparts, since more information is …

被引用次数：206 相关文章所有 8 个版本

高级搜索

QQ 群