A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Data curation via joint example selection further accelerates multimodal learning

T Evans, N Parthasarathy, H Merzic… - arXiv preprint arXiv …, 2024 - arxiv.org
Data curation is an essential component of large-scale pretraining. In this work, we
demonstrate that jointly selecting batches of data is more effective for learning than selecting …

Data management for large language models: A survey

Z Wang, W Zhong, Y Wang, Q Zhu, F Mi, B Wang… - CoRR, 2023 - openreview.net
Data plays a fundamental role in the training of Large Language Models (LLMs). Effective
data management, particularly in the formulation of a well-suited training dataset, holds …

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

R Hong, W Agnew, T Kohno… - Proceedings of the 4th …, 2024 - dl.acm.org
As training datasets become increasingly drawn from unstructured, uncontrolled
environments such as the web, researchers and industry practitioners have increasingly …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Optimization hyper-parameter laws for large language models

X Xie, S Yan, KC Toh, T Wei - arXiv preprint arXiv:2409.04777, 2024 - arxiv.org
Large Language Models have driven significant AI advancements, yet their training is
resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws …

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

A Hägele, E Bakouch, A Kosson, LB Allal… - arXiv preprint arXiv …, 2024 - arxiv.org
Scale has become a main ingredient in obtaining strong machine learning models. As a
result, understanding a model's scaling properties is key to effectively designing both the …

Active data curation effectively distills large-scale multimodal models

V Udandarao, N Parthasarathy, MF Naeem… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …