What Makes Multimodal In-Context Learning Work?

FB Baldassini, M Shukor, M Cord… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Language Models have demonstrated remarkable performance across
various tasks exhibiting the capacity to swiftly acquire new skills such as through In-Context …

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

MMEarth: Exploring multi-modal pretext tasks for geospatial representation learning

V Nedungadi, A Kariryaa, S Oehmcke… - arXiv preprint arXiv …, 2024 - arxiv.org
The volume of unlabelled Earth observation (EO) data is huge, but many important
applications lack labelled training data. However, EO data offers the unique opportunity to …

BiMAE-A Bimodal Masked Autoencoder Architecture for Single-Label Hyperspectral Image Classification

M Kukushkin, M Bogdan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Hyperspectral imaging offers manifold opportunities for applications that may not or only
partially be achieved within the visual spectrum. Our paper presents a novel approach for …

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

R Aguina-Kang, M Gumin, DH Han, S Morris… - arXiv preprint arXiv …, 2024 - arxiv.org
We present a system for generating indoor scenes in response to text prompts. The prompts
are not limited to a fixed vocabulary of scene descriptions, and the objects in generated …

An Image is Worth 32 Tokens for Reconstruction and Generation

Q Yu, M Weber, X Deng, X Shen, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in generative models have highlighted the crucial role of image
tokenization in the efficient synthesis of high-resolution images. Tokenization, which …

SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA

P Xiao, P Qiu, SM Ha, A Bani, S Zhou… - Available at SSRN …, 2023 - papers.ssrn.com
Learning rich data representations from unlabeled data is a key challenge towards applying
deep learning algorithms in downstream tasks. Several variants of variational autoencoders …

The Evolution of Multimodal Model Architectures

SN Wadekar, A Chaurasia, A Chadha… - arXiv preprint arXiv …, 2024 - arxiv.org
This work uniquely identifies and characterizes four prevalent multimodal model
architectural patterns in the contemporary multimodal landscape. Systematically …

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

S Yang, Z Zhong, M Zhao, S Takahashi, M Ishii… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, with the realistic generation results and a wide range of personalized
applications, diffusion-based generative models gain huge attention in both visual and …

Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry

M Alberts, O Schilter, F Zipoli, N Hartrampf… - arXiv preprint arXiv …, 2024 - arxiv.org
Spectroscopic techniques are essential tools for determining the structure of molecules.
Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared …