ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for la...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Pang, Wei, Shafieinejad, Masoumeh, Liu, Lucy, Hazlewood, Stephanie, He, Xi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Pang, Wei
Shafieinejad, Masoumeh
Liu, Lucy
Hazlewood, Stephanie
He, Xi
description Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
doi_str_mv 10.48550/arxiv.2405.17724
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_17724</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_17724</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-9bc2f1fb2d3d5c936151534cd7534edfab8c79b7453d311d7b33f1c0452e5b1d3</originalsourceid><addsrcrecordid>eNotz8tOwzAUBFBvWKDCB7DCP-AQv-qGHUp4SY2KRPfRta9NLZkWxU6hf09p2cxsRiMdQm54XamF1vUdjD9xXwlV64obI9QlWbUJ9tB1b_097adUIht9ghJ3W0i0gwL0_bAtG59jpt-xbGibplz8yD6miB5pF0OY8nFO-x36lK_IRYCU_fV_z8j66XHdvrDl6vm1fVgymBvFGutE4MEKlKhdI-dccy2VQ3NMjwHswpnGGqUlSs7RWCkDd7XSwmvLUc7I7fn2JBq-xvgJ42H4kw0nmfwFu9RJEA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><source>arXiv.org</source><creator>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</creator><creatorcontrib>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</creatorcontrib><description>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</description><identifier>DOI: 10.48550/arxiv.2405.17724</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.17724$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.17724$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pang, Wei</creatorcontrib><creatorcontrib>Shafieinejad, Masoumeh</creatorcontrib><creatorcontrib>Liu, Lucy</creatorcontrib><creatorcontrib>Hazlewood, Stephanie</creatorcontrib><creatorcontrib>He, Xi</creatorcontrib><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><description>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</description><subject>Computer Science - Artificial Intelligence</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz8tOwzAUBFBvWKDCB7DCP-AQv-qGHUp4SY2KRPfRta9NLZkWxU6hf09p2cxsRiMdQm54XamF1vUdjD9xXwlV64obI9QlWbUJ9tB1b_097adUIht9ghJ3W0i0gwL0_bAtG59jpt-xbGibplz8yD6miB5pF0OY8nFO-x36lK_IRYCU_fV_z8j66XHdvrDl6vm1fVgymBvFGutE4MEKlKhdI-dccy2VQ3NMjwHswpnGGqUlSs7RWCkDd7XSwmvLUc7I7fn2JBq-xvgJ42H4kw0nmfwFu9RJEA</recordid><startdate>20240527</startdate><enddate>20240527</enddate><creator>Pang, Wei</creator><creator>Shafieinejad, Masoumeh</creator><creator>Liu, Lucy</creator><creator>Hazlewood, Stephanie</creator><creator>He, Xi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240527</creationdate><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><author>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-9bc2f1fb2d3d5c936151534cd7534edfab8c79b7453d311d7b33f1c0452e5b1d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><toplevel>online_resources</toplevel><creatorcontrib>Pang, Wei</creatorcontrib><creatorcontrib>Shafieinejad, Masoumeh</creatorcontrib><creatorcontrib>Liu, Lucy</creatorcontrib><creatorcontrib>Hazlewood, Stephanie</creatorcontrib><creatorcontrib>He, Xi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pang, Wei</au><au>Shafieinejad, Masoumeh</au><au>Liu, Lucy</au><au>Hazlewood, Stephanie</au><au>He, Xi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</atitle><date>2024-05-27</date><risdate>2024</risdate><abstract>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</abstract><doi>10.48550/arxiv.2405.17724</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2405.17724
ispartof
issn
language eng
recordid cdi_arxiv_primary_2405_17724
source arXiv.org
subjects Computer Science - Artificial Intelligence
title ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T05%3A43%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ClavaDDPM:%20Multi-relational%20Data%20Synthesis%20with%20Cluster-guided%20Diffusion%20Models&rft.au=Pang,%20Wei&rft.date=2024-05-27&rft_id=info:doi/10.48550/arxiv.2405.17724&rft_dat=%3Carxiv_GOX%3E2405_17724%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true