- 学术资源搜索

The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com

Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

被引用次数：192 相关文章所有 8 个版本

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：171 相关文章所有 7 个版本

[PDF] arxiv.org

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：2672 相关文章所有 4 个版本

[PDF] neurips.cc

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

W Wang, Z Chen, X Chen, J Wu… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have notably accelerated progress towards artificial general
intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing …

被引用次数：349 相关文章所有 6 个版本

[PDF] arxiv.org

Mimic-it: Multi-modal in-context instruction tuning

B Li, Y Zhang, L Chen, J Wang, F Pu, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

High-quality instructions and responses are essential for the zero-shot performance of large
language models on interactive natural language tasks. For interactive vision-language …

被引用次数：543 相关文章所有 4 个版本

卷积神经网络研究综述

李彦冬，郝宗波，雷航 - 计算机应用, 2016 - joca.cn

近年来, 卷积神经网络在图像分类, 目标检测, 图像语义分割等领域取得了一系列突破性的研究
成果, 其强大的特征学习与分类能力引起了广泛的关注, 具有重要的分析与研究价值 …

被引用次数：139 相关文章所有 4 个版本

[PDF] arxiv.org

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

被引用次数：737 相关文章所有 4 个版本

[PDF] arxiv.org

Visual chatgpt: Talking, drawing and editing with visual foundation models

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arXiv preprint arXiv …, 2023 - arxiv.org

ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …

被引用次数：584 相关文章所有 3 个版本

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

被引用次数：307 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

被引用次数：826 相关文章所有 6 个版本

高级搜索

QQ 群

The multi-modal fusion in visual question answering: a review of attention mechanisms

Vision-language pre-training: Basics, recent advances, and future trends

A survey of large language models

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Mimic-it: Multi-modal in-context instruction tuning

卷积神经网络研究综述

The llama 3 herd of models

Visual chatgpt: Talking, drawing and editing with visual foundation models

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

A survey on multimodal large language models

引用