ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Pang, Wei, Shafieinejad, Masoumeh, Liu, Lucy, Hazlewood, Stephanie, He, Xi
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Pang, Wei Shafieinejad, Masoumeh Liu, Lucy Hazlewood, Stephanie He, Xi
description	Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
doi_str_mv	10.48550/arxiv.2405.17724
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_17724</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_17724</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-9bc2f1fb2d3d5c936151534cd7534edfab8c79b7453d311d7b33f1c0452e5b1d3</originalsourceid><addsrcrecordid>eNotz8tOwzAUBFBvWKDCB7DCP-AQv-qGHUp4SY2KRPfRta9NLZkWxU6hf09p2cxsRiMdQm54XamF1vUdjD9xXwlV64obI9QlWbUJ9tB1b_097adUIht9ghJ3W0i0gwL0_bAtG59jpt-xbGibplz8yD6miB5pF0OY8nFO-x36lK_IRYCU_fV_z8j66XHdvrDl6vm1fVgymBvFGutE4MEKlKhdI-dccy2VQ3NMjwHswpnGGqUlSs7RWCkDd7XSwmvLUc7I7fn2JBq-xvgJ42H4kw0nmfwFu9RJEA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><source>arXiv.org</source><creator>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</creator><creatorcontrib>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</creatorcontrib><description>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</description><identifier>DOI: 10.48550/arxiv.2405.17724</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.17724$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.17724$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pang, Wei</creatorcontrib><creatorcontrib>Shafieinejad, Masoumeh</creatorcontrib><creatorcontrib>Liu, Lucy</creatorcontrib><creatorcontrib>Hazlewood, Stephanie</creatorcontrib><creatorcontrib>He, Xi</creatorcontrib><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><description>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</description><subject>Computer Science - Artificial Intelligence</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz8tOwzAUBFBvWKDCB7DCP-AQv-qGHUp4SY2KRPfRta9NLZkWxU6hf09p2cxsRiMdQm54XamF1vUdjD9xXwlV64obI9QlWbUJ9tB1b_097adUIht9ghJ3W0i0gwL0_bAtG59jpt-xbGibplz8yD6miB5pF0OY8nFO-x36lK_IRYCU_fV_z8j66XHdvrDl6vm1fVgymBvFGutE4MEKlKhdI-dccy2VQ3NMjwHswpnGGqUlSs7RWCkDd7XSwmvLUc7I7fn2JBq-xvgJ42H4kw0nmfwFu9RJEA</recordid><startdate>20240527</startdate><enddate>20240527</enddate><creator>Pang, Wei</creator><creator>Shafieinejad, Masoumeh</creator><creator>Liu, Lucy</creator><creator>Hazlewood, Stephanie</creator><creator>He, Xi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240527</creationdate><title>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</title><author>Pang, Wei ; Shafieinejad, Masoumeh ; Liu, Lucy ; Hazlewood, Stephanie ; He, Xi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-9bc2f1fb2d3d5c936151534cd7534edfab8c79b7453d311d7b33f1c0452e5b1d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><toplevel>online_resources</toplevel><creatorcontrib>Pang, Wei</creatorcontrib><creatorcontrib>Shafieinejad, Masoumeh</creatorcontrib><creatorcontrib>Liu, Lucy</creatorcontrib><creatorcontrib>Hazlewood, Stephanie</creatorcontrib><creatorcontrib>He, Xi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pang, Wei</au><au>Shafieinejad, Masoumeh</au><au>Liu, Lucy</au><au>Hazlewood, Stephanie</au><au>He, Xi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models</atitle><date>2024-05-27</date><risdate>2024</risdate><abstract>Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.</abstract><doi>10.48550/arxiv.2405.17724</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.17724
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_17724
source	arXiv.org
subjects	Computer Science - Artificial Intelligence
title	ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T05%3A43%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ClavaDDPM:%20Multi-relational%20Data%20Synthesis%20with%20Cluster-guided%20Diffusion%20Models&rft.au=Pang,%20Wei&rft.date=2024-05-27&rft_id=info:doi/10.48550/arxiv.2405.17724&rft_dat=%3Carxiv_GOX%3E2405_17724%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true