The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

X Xie, P Zhou, H Li, Z Lin, S Yan - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
In deep learning, different kinds of deep networks typically need different optimizers, which
have to be chosen after multiple trials, making the training process inefficient. To relieve this …

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Olmo: Accelerating the science of language models

D Groeneveld, I Beltagy, P Walsh, A Bhagia… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models (LMs) have become ubiquitous in both NLP research and in commercial
product offerings. As their commercial importance has surged, the most powerful models …

Are We on the Right Way for Evaluating Large Vision-Language Models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

JY Koh, R Lo, L Jang, V Duvvur, MC Lim… - arXiv preprint arXiv …, 2024 - arxiv.org
Autonomous agents capable of planning, reasoning, and executing actions on the web offer
a promising avenue for automating computer tasks. However, the majority of existing …