4m: Massively multimodal masked modeling

FB Baldassini, M Shukor, M Cord… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Language Models have demonstrated remarkable performance across
various tasks exhibiting the capacity to swiftly acquire new skills such as through In-Context …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

MMEarth: Exploring multi-modal pretext tasks for geospatial representation learning

V Nedungadi, A Kariryaa, S Oehmcke… - arXiv preprint arXiv …, 2024 - arxiv.org

The volume of unlabelled Earth observation (EO) data is huge, but many important
applications lack labelled training data. However, EO data offers the unique opportunity to …

被引用次数：1 相关文章所有 2 个版本

[PDF] thecvf.com

BiMAE-A Bimodal Masked Autoencoder Architecture for Single-Label Hyperspectral Image Classification

M Kukushkin, M Bogdan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Hyperspectral imaging offers manifold opportunities for applications that may not or only
partially be achieved within the visual spectrum. Our paper presents a novel approach for …

[PDF] arxiv.org

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

R Aguina-Kang, M Gumin, DH Han, S Morris… - arXiv preprint arXiv …, 2024 - arxiv.org

We present a system for generating indoor scenes in response to text prompts. The prompts
are not limited to a fixed vocabulary of scene descriptions, and the objects in generated …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

An Image is Worth 32 Tokens for Reconstruction and Generation

Q Yu, M Weber, X Deng, X Shen, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in generative models have highlighted the crucial role of image
tokenization in the efficient synthesis of high-resolution images. Tokenization, which …

被引用次数：4 相关文章所有 2 个版本

SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA

P Xiao, P Qiu, SM Ha, A Bani, S Zhou… - Available at SSRN …, 2023 - papers.ssrn.com

Learning rich data representations from unlabeled data is a key challenge towards applying
deep learning algorithms in downstream tasks. Several variants of variational autoencoders …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

The Evolution of Multimodal Model Architectures

SN Wadekar, A Chaurasia, A Chadha… - arXiv preprint arXiv …, 2024 - arxiv.org

This work uniquely identifies and characterizes four prevalent multimodal model
architectural patterns in the contemporary multimodal landscape. Systematically …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

S Yang, Z Zhong, M Zhao, S Takahashi, M Ishii… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, with the realistic generation results and a wide range of personalized
applications, diffusion-based generative models gain huge attention in both visual and …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry

M Alberts, O Schilter, F Zipoli, N Hartrampf… - arXiv preprint arXiv …, 2024 - arxiv.org

Spectroscopic techniques are essential tools for determining the structure of molecules.
Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared …

高级搜索

QQ 群