On the Limitation of Diffusion Models for Synthesizing Training Datasets
Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. Thi...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Synthetic samples from diffusion models are promising for leveraging in
training discriminative models as replications of real training datasets.
However, we found that the synthetic datasets degrade classification
performance over real datasets even when using state-of-the-art diffusion
models. This means that modern diffusion models do not perfectly represent the
data distribution for the purpose of replicating datasets for training
discriminative tasks. This paper investigates the gap between synthetic and
real samples by analyzing the synthetic samples reconstructed from real samples
through the diffusion and reverse process. By varying the time steps starting
the reverse process in the reconstruction, we can control the trade-off between
the information in the original real data and the information added by
diffusion models. Through assessing the reconstructed samples and trained
models, we found that the synthetic data are concentrated in modes of the
training data distribution as the reverse step increases, and thus, they are
difficult to cover the outer edges of the distribution. Our findings imply that
modern diffusion models are insufficient to replicate training data
distribution perfectly, and there is room for the improvement of generative
modeling in the replication of training datasets. |
---|---|
DOI: | 10.48550/arxiv.2311.13090 |