Trading Off Scalability, Privacy, and Performance in Data Synthesis

Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learnin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.26642-26654
Hauptverfasser: Ling, Xiao, Menzies, Tim, Hazard, Christopher, Shu, Jack, Beel, Jacob
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 26654
container_issue
container_start_page 26642
container_title IEEE access
container_volume 12
creator Ling, Xiao
Menzies, Tim
Hazard, Christopher
Shu, Jack
Beel, Jacob
description Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results in the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
doi_str_mv 10.1109/ACCESS.2024.3366556
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10438420</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10438420</ieee_id><doaj_id>oai_doaj_org_article_7cc6917baf9b47ff96051e88100d1ae9</doaj_id><sourcerecordid>2930957791</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-e4f9599c7ae9932ff3aab05bf699887dd6c660270b688f0903305a02025c8d3a3</originalsourceid><addsrcrecordid>eNpNkE1rwzAMhsPYYKXrL9gOgV3XTo5jxz6WrPuAQgvpzkZx7M6lTTonHfTfz13KqC4SQu8r6YmiewITQkA-T_N8VhSTBJJ0QinnjPGraJAQLseUUX59Ud9Go7bdQAgRWiwbRPnKY-XqdbywNi40brF0W9cdn-Kldz-oQ4F1FS-Nt43fYa1N7Or4BTuMi2PdfZnWtXfRjcVta0bnPIw-X2er_H08X7x95NP5WFMmu7FJrWRS6gyNlDSxliKWwErLpRQiqyquOYckg5ILYUECpcAQwl9Mi4oiHUYfvW_V4EbtvduhP6oGnfprNH6t0HdOb43KtOaSZCVaWaaZtZIDI0YIAlCRsD94PfZee998H0zbqU1z8HU4XyWSQmCTSRKmaD-lfdO23tj_rQTUCb7q4asTfHWGH1QPvcoZYy4UKRVpAvQXw5h-Cw</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2930957791</pqid></control><display><type>article</type><title>Trading Off Scalability, Privacy, and Performance in Data Synthesis</title><source>Directory of Open Access Journals</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Ling, Xiao ; Menzies, Tim ; Hazard, Christopher ; Shu, Jack ; Beel, Jacob</creator><creatorcontrib>Ling, Xiao ; Menzies, Tim ; Hazard, Christopher ; Shu, Jack ; Beel, Jacob</creatorcontrib><description>Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results in the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3366556</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Accuracy ; Algorithms ; Biomedical imaging ; classification ; Classification algorithms ; Clustering algorithms ; Data models ; Data privacy ; Engines ; Generative adversarial networks ; Homomorphic encryption ; Machine learning ; Privacy ; privacy preservation ; regression ; Regression analysis ; Scalability ; Synthetic data ; Synthetic data generation</subject><ispartof>IEEE access, 2024, Vol.12, p.26642-26654</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c359t-e4f9599c7ae9932ff3aab05bf699887dd6c660270b688f0903305a02025c8d3a3</cites><orcidid>0000-0002-5040-3196 ; 0000-0002-1398-9319</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10438420$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Ling, Xiao</creatorcontrib><creatorcontrib>Menzies, Tim</creatorcontrib><creatorcontrib>Hazard, Christopher</creatorcontrib><creatorcontrib>Shu, Jack</creatorcontrib><creatorcontrib>Beel, Jacob</creatorcontrib><title>Trading Off Scalability, Privacy, and Performance in Data Synthesis</title><title>IEEE access</title><addtitle>Access</addtitle><description>Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results in the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Biomedical imaging</subject><subject>classification</subject><subject>Classification algorithms</subject><subject>Clustering algorithms</subject><subject>Data models</subject><subject>Data privacy</subject><subject>Engines</subject><subject>Generative adversarial networks</subject><subject>Homomorphic encryption</subject><subject>Machine learning</subject><subject>Privacy</subject><subject>privacy preservation</subject><subject>regression</subject><subject>Regression analysis</subject><subject>Scalability</subject><subject>Synthetic data</subject><subject>Synthetic data generation</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkE1rwzAMhsPYYKXrL9gOgV3XTo5jxz6WrPuAQgvpzkZx7M6lTTonHfTfz13KqC4SQu8r6YmiewITQkA-T_N8VhSTBJJ0QinnjPGraJAQLseUUX59Ud9Go7bdQAgRWiwbRPnKY-XqdbywNi40brF0W9cdn-Kldz-oQ4F1FS-Nt43fYa1N7Or4BTuMi2PdfZnWtXfRjcVta0bnPIw-X2er_H08X7x95NP5WFMmu7FJrWRS6gyNlDSxliKWwErLpRQiqyquOYckg5ILYUECpcAQwl9Mi4oiHUYfvW_V4EbtvduhP6oGnfprNH6t0HdOb43KtOaSZCVaWaaZtZIDI0YIAlCRsD94PfZee998H0zbqU1z8HU4XyWSQmCTSRKmaD-lfdO23tj_rQTUCb7q4asTfHWGH1QPvcoZYy4UKRVpAvQXw5h-Cw</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Ling, Xiao</creator><creator>Menzies, Tim</creator><creator>Hazard, Christopher</creator><creator>Shu, Jack</creator><creator>Beel, Jacob</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-5040-3196</orcidid><orcidid>https://orcid.org/0000-0002-1398-9319</orcidid></search><sort><creationdate>2024</creationdate><title>Trading Off Scalability, Privacy, and Performance in Data Synthesis</title><author>Ling, Xiao ; Menzies, Tim ; Hazard, Christopher ; Shu, Jack ; Beel, Jacob</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-e4f9599c7ae9932ff3aab05bf699887dd6c660270b688f0903305a02025c8d3a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Biomedical imaging</topic><topic>classification</topic><topic>Classification algorithms</topic><topic>Clustering algorithms</topic><topic>Data models</topic><topic>Data privacy</topic><topic>Engines</topic><topic>Generative adversarial networks</topic><topic>Homomorphic encryption</topic><topic>Machine learning</topic><topic>Privacy</topic><topic>privacy preservation</topic><topic>regression</topic><topic>Regression analysis</topic><topic>Scalability</topic><topic>Synthetic data</topic><topic>Synthetic data generation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ling, Xiao</creatorcontrib><creatorcontrib>Menzies, Tim</creatorcontrib><creatorcontrib>Hazard, Christopher</creatorcontrib><creatorcontrib>Shu, Jack</creatorcontrib><creatorcontrib>Beel, Jacob</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ling, Xiao</au><au>Menzies, Tim</au><au>Hazard, Christopher</au><au>Shu, Jack</au><au>Beel, Jacob</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Trading Off Scalability, Privacy, and Performance in Data Synthesis</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>26642</spage><epage>26654</epage><pages>26642-26654</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results in the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3366556</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-5040-3196</orcidid><orcidid>https://orcid.org/0000-0002-1398-9319</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024, Vol.12, p.26642-26654
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_10438420
source Directory of Open Access Journals; IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects Accuracy
Algorithms
Biomedical imaging
classification
Classification algorithms
Clustering algorithms
Data models
Data privacy
Engines
Generative adversarial networks
Homomorphic encryption
Machine learning
Privacy
privacy preservation
regression
Regression analysis
Scalability
Synthetic data
Synthetic data generation
title Trading Off Scalability, Privacy, and Performance in Data Synthesis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T16%3A05%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Trading%20Off%20Scalability,%20Privacy,%20and%20Performance%20in%20Data%20Synthesis&rft.jtitle=IEEE%20access&rft.au=Ling,%20Xiao&rft.date=2024&rft.volume=12&rft.spage=26642&rft.epage=26654&rft.pages=26642-26654&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3366556&rft_dat=%3Cproquest_ieee_%3E2930957791%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2930957791&rft_id=info:pmid/&rft_ieee_id=10438420&rft_doaj_id=oai_doaj_org_article_7cc6917baf9b47ff96051e88100d1ae9&rfr_iscdi=true