Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate millions of images a day …
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks …
We survey 146 papers analyzing" bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that …
The recent release of ChatGPT has garnered widespread recognition for its exceptional ability to generate human-like responses in dialogue. Given its usage by users from various …
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens …
J Cho, A Zala, M Bansal - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recently, DALL-E, a multimodal transformer language model, and its variants including diffusion models have shown high-quality text-to-image generation capabilities. However …
Recently, contrastive learning has been shown to be effective in improving pre-trained language models (PLM) to derive high-quality sentence representations. It aims to pull close …
Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world …