CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation- 学术资源搜索

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Z Tang, Z Yang, M Khademi, Y Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Z Tang, Z Yang, M Khademi, Y Liu, C Zhu, M Bansal

Proceedings of the IEEE/CVF Conference on Computer Vision and …, 2024•openaccess.thecvf.com

Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-
context interleaved multimodal representations. By aligning modalities with language for
both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to
understand modality-interleaved instructions and in-context examples and autoregressively
generate grounded and coherent multimodal outputs in an any-to-any input-output modality
paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in …

Abstract

We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with language for both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in-context multimodal instructions across text vision and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing exemplar learning composition reasoning etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation vision transformation and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.

openaccess.thecvf.com

展开收起

被引用次数：12 相关文章所有 3 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

引用