Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data

M Gerstgrasser, R Schaeffer, A Dey, R Rafailov… - arXiv preprint arXiv …, 2024 - arxiv.org
The proliferation of generative models, combined with pretraining on web-scale data, raises
a timely question: what happens when these models are trained on their own generated …

Safety vs. performance: How multi-objective learning reduces barriers to market entry

M Jagadeesan, MI Jordan, J Steinhardt - arXiv preprint arXiv:2409.03734, 2024 - arxiv.org
Emerging marketplaces for large language models and other large-scale machine learning
(ML) models appear to exhibit market concentration, which has raised concerns about …

Universality of the Pathway in Avoiding Model Collapse

A Dey, D Donoho - arXiv preprint arXiv:2410.22812, 2024 - arxiv.org
Researchers in empirical machine learning recently spotlighted their fears of so-called
Model Collapse. They imagined a discard workflow, where an initial generative model is …

Scaling Laws in Linear Regression: Compute, Parameters, and Data

L Lin, J Wu, SM Kakade, PL Bartlett, JD Lee - arXiv preprint arXiv …, 2024 - arxiv.org
Empirically, large-scale deep learning models often satisfy a neural scaling law: the test
error of the trained model improves polynomially as the model size and data size grow …

Universality in Transfer Learning for Linear Models

R Ghane, D Akhtiamov, B Hassibi - arXiv preprint arXiv:2410.02164, 2024 - arxiv.org
Transfer learning is an attractive framework for problems where there is a paucity of data, or
where data collection is costly. One common approach to transfer learning is referred to as" …

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

R Nakada, Y Xu, L Li, L Zhang - arXiv preprint arXiv:2406.03628, 2024 - arxiv.org
Imbalanced data and spurious correlations are common challenges in machine learning
and data science. Oversampling, which artificially increases the number of instances in the …

Finding Optimally Robust Data Mixtures via Concave Maximization

A Thudi, CJ Maddison - arXiv preprint arXiv:2406.01477, 2024 - arxiv.org
Training on mixtures of data distributions is now common in many modern machine learning
pipelines, useful for performing well on several downstream tasks. Group distributionally …

[HTML][HTML] Влияние качества разметки данных в моделях для предсказания субъективных впечатлений пользователей

МА Бакаев, ВА Хворостов - International Journal of Open …, 2024 - cyberleninka.ru
Качество обучающих данных широко признаётся важнейшим условием для успешного
создания моделей машинного обучения (ML), однако конкретные составляющие этого …

How web interface labeling quality affects ML models predicting users' subjective impressions

MA Bakaev, VA Khvorostov - International Journal of Open Information …, 2024 - injoit.ru
Training data quality is widely recognized as the main pre-requisite for constructing
successful Machine Learning (ML) models. However, the concrete aspects of the data …

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

S Capital - openreview.net
The proliferation of generative models, combined with pretraining on web-scale data, raises
a timely question: what happens when these models are trained on their own generated …