Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio …
S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient …
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding …
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However existing works rely heavily on modality …
The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to …
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given …
P Li, CW Xie, L Zhao, H Xie, J Ge… - Proceedings of the …, 2023 - openaccess.thecvf.com
The performance of text-video retrieval has been significantly improved by vision-language cross-modal learning schemes. The typical solution is to directly align the global video-level …
Abstract We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in- context interleaved multimodal representations. By aligning modalities with language for …
T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text …