Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Easily accessible text-to-image generation amplifies demographic stereotypes at large scale

F Bianchi, P Kalluri, E Durmus, F Ladhak… - Proceedings of the …, 2023 - dl.acm.org
Machine learning models that convert user-written text descriptions into images are now
widely available online and used by millions of users to generate millions of images a day …

[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models

Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …

Language (technology) is power: A critical survey of" bias" in nlp

SL Blodgett, S Barocas, H Daumé III… - arXiv preprint arXiv …, 2020 - arxiv.org
We survey 146 papers analyzing" bias" in NLP systems, finding that their motivations are
often vague, inconsistent, and lacking in normative reasoning, despite the fact that …

Assessing cross-cultural alignment between ChatGPT and human societies: An empirical study

Y Cao, L Zhou, S Lee, L Cabello, M Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
The recent release of ChatGPT has garnered widespread recognition for its exceptional
ability to generate human-like responses in dialogue. Given its usage by users from various …

Cm3: A causal masked multimodal model of the internet

A Aghajanyan, B Huang, C Ross, V Karpukhin… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce CM3, a family of causally masked generative models trained over a large
corpus of structured multi-modal documents that can contain both text and image tokens …

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

J Cho, A Zala, M Bansal - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recently, DALL-E, a multimodal transformer language model, and its variants including
diffusion models have shown high-quality text-to-image generation capabilities. However …

Debiased contrastive learning of unsupervised sentence representations

K Zhou, B Zhang, WX Zhao, JR Wen - arXiv preprint arXiv:2205.00656, 2022 - arxiv.org
Recently, contrastive learning has been shown to be effective in improving pre-trained
language models (PLM) to derive high-quality sentence representations. It aims to pull close …

[HTML][HTML] Multibench: Multiscale benchmarks for multimodal representation learning

PP Liang, Y Lyu, X Fan, Z Wu, Y Cheng… - Advances in neural …, 2021 - ncbi.nlm.nih.gov
Learning multimodal representations involves integrating information from multiple
heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world …