Do large language models have a legal duty to tell the truth?

S Wachter, B Mittelstadt… - Royal Society Open …, 2024 - royalsocietypublishing.org
Careless speech is a new type of harm created by large language models (LLM) that poses
cumulative, long-term risks to science, education and shared social truth in democratic …

No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - The Thirty-eighth …, 2024 - openreview.net
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

A tale of tails: Model collapse as a change of scaling laws

E Dohmatob, Y Feng, P Yang, F Charton… - arXiv preprint arXiv …, 2024 - arxiv.org
As AI model size grows, neural scaling laws have become a crucial tool to predict the
improvements of large models when increasing capacity and the size of original (human or …

Beyond model collapse: Scaling up with synthesized data requires reinforcement

Y Feng, E Dohmatob, P Yang, F Charton… - ICML 2024 Workshop …, 2024 - openreview.net
Synthesized data from generative models is increasingly considered as an alternative to
human-annotated data for fine-tuning Large Language Models. This raises concerns about …

Datadream: Few-shot guided dataset generation

JM Kim, J Bader, S Alaniz, C Schmid… - European Conference on …, 2025 - Springer
While text-to-image diffusion models have been shown to achieve state-of-the-art results in
image synthesis, they have yet to prove their effectiveness in downstream applications …

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

S Zheng, Z Bao, R Zhao, M Hebert… - arXiv preprint arXiv …, 2024 - arxiv.org
Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising
results in dense visual perception tasks. However, most existing work treats diffusion models …

[PDF][PDF] BEYOND MODEL COLLAPSE: SCALING UP WITH SYN-THESIZED DATA REQUIRES VERIFICATION

Y Feng, E Dohmatob, P Yang, F Charton… - arXiv preprint arXiv …, 2024 - rivista.ai
ABSTRACT Large Language Models (LLM) are increasingly trained on data generated by
other LLM, either because generated text and images become part of the pretraining corpus …

面向3D 目标检测的多模态生成式图像数据增强的研究

张光钱, 周广利, 黄飞, 刘文兵… - 重庆理工大学学报(自然 …, 2024 - clgzk.qks.cqut.edu.cn
针对传统生成式图像数据增强算法丢失3D 属性信息, 无法应用于自动驾驶领域3D
目标检测任务的问题, 提出了一种基于稳定扩散模型的多模态图像生成算法 …

Synthetic Data in AI-Driven Earth Observation: an Insight Into the SD4EO Project

M Fernández, J Gimeno, R Gini… - IGARSS 2024-2024 …, 2024 - ieeexplore.ieee.org
The" Physically-Based Synthetic Data for Earth Observation"(SD4EO) project 1, initiated in
October 2023, aims to integrate physically-based simulation data and artificial intelligence …

Data Augmentation Techniques Using Text-to-Image Diffusion Models for Enhanced Data Diversity

J Shin, H Jang - 2024 15th International Conference on …, 2024 - ieeexplore.ieee.org
Data augmentation is a widely used technique to enhance the performance of deep learning
models. However, traditional augmentation methods, dependent solely on original data …