related:ESaRkW_t3VQJ:scholar.google.com/

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

P Gao, R Zhang, C Liu, L Qiu, S Huang, W Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series
developed upon SPHINX. To improve the architecture and training efficiency, we modify the …

被引用次数：41 相关文章所有 3 个版本

[PDF] arxiv.org

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Z Lin, C Liu, R Zhang, P Gao, L Qiu, H Xiao… - arXiv preprint arXiv …, 2023 - arxiv.org

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint
mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision …

被引用次数：106 相关文章所有 9 个版本

[PDF] arxiv.org

Tinygpt-v: Efficient multimodal large language model via small backbones

Z Yuan, Z Li, L Sun - arXiv preprint arXiv:2312.16862, 2023 - arxiv.org

In the era of advanced multimodel learning, multimodal large language models (MLLMs)
such as GPT-4V have made remarkable strides towards bridging language and visual …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration

C Lyu, M Wu, L Wang, X Huang, B Liu, Z Du… - arXiv preprint arXiv …, 2023 - arxiv.org

Although instruction-tuned large language models (LLMs) have exhibited remarkable
capabilities across various NLP tasks, their effectiveness on other data modalities beyond …

被引用次数：47 相关文章所有 4 个版本

[PDF] arxiv.org

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages

F Chen, M Han, H Zhao, Q Zhang, J Shi, S Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4,
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …

被引用次数：69 相关文章所有 2 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：51 相关文章所有 2 个版本

[PDF] arxiv.org

Efficient multimodal learning from data-centric perspective

M He, Y Liu, B Wu, J Yuan, Y Wang, T Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in
general visual understanding and reasoning tasks. However, their deployment is hindered …

被引用次数：20 相关文章所有 2 个版本

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

被引用次数：223 相关文章所有 4 个版本

[PDF] arxiv.org

Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Q-bench: A benchmark for general-purpose foundation models on low-level vision

H Wu, Z Zhang, E Zhang, C Chen, L Liao… - arXiv preprint arXiv …, 2023 - arxiv.org

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift
in computer vision from specialized models to general-purpose foundation models …

被引用次数：58 相关文章所有 5 个版本

高级搜索

QQ 群

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Tinygpt-v: Efficient multimodal large language model via small backbones

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration

X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Efficient multimodal learning from data-centric perspective

Next-gpt: Any-to-any multimodal llm

Blink: Multimodal large language models can see but not perceive

Q-bench: A benchmark for general-purpose foundation models on low-level vision

引用