ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for la...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent research in tabular data synthesis has focused on single tables,
whereas real-world applications often involve complex data with tens or
hundreds of interconnected tables. Previous approaches to synthesizing
multi-relational (multi-table) data fall short in two key aspects: scalability
for larger datasets and capturing long-range dependencies, such as correlations
between attributes spread across different tables. Inspired by the success of
diffusion models in tabular data modeling, we introduce
$\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$
$\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$
$\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels
as intermediaries to model relationships between tables, specifically focusing
on foreign key constraints. ClavaDDPM leverages the robust generation
capabilities of diffusion models while incorporating efficient algorithms to
propagate the learned latent variables across tables. This enables ClavaDDPM to
capture long-range dependencies effectively.
Extensive evaluations on multi-table datasets of varying sizes show that
ClavaDDPM significantly outperforms existing methods for these long-range
dependencies while remaining competitive on utility metrics for single-table
data. |
---|---|
DOI: | 10.48550/arxiv.2405.17724 |