User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy

Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals' records, which is in violation of regulations such as the European Union's General Data Protection Regulatio...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on services computing 2023-09, Vol.16 (5), p.1-14
Hauptverfasser:	Tai, Bo-Chen, Tsou, Yao-Tung, Li, Szu-Chuang, Huang, Yennun, Tsai, Pei-Yuan, Tsai, Yu-Cheng
Format:	Artikel
Sprache:	eng
Schlagworte:	<inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> k</tex-math> </inline-formula>-level Budgets data hit rate Data mining Data models Data privacy Data protection Datasets Differential privacy Distributed databases Government Privacy Risk Synthetic data synthetic dataset hunting
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	14
container_issue	5
container_start_page	1
container_title	IEEE transactions on services computing
container_volume	16
creator	Tai, Bo-Chen Tsou, Yao-Tung Li, Szu-Chuang Huang, Yennun Tsai, Pei-Yuan Tsai, Yu-Cheng
description	Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals' records, which is in violation of regulations such as the European Union's General Data Protection Regulation. Differential privacy has been proposed as a possible solution to the aforementioned problem. The privacy budget \varepsilon in differential privacy is for theoretical interpretation, but in practice, its application in measuring the risk of data disclosure has not been well studied, especially with sampling-based synthetic datasets. Moreover, datasets released by data owners with quantifiable privacy levels and the explicit utility for these datasets have yet to be well developed. In this paper, we present an intuitive approach for defining the privacy level (i.e., data hit rate and k-level) and utility level (i.e., basic statistics and a series of data mining models), and the privacy budget \varepsilon is quantified for evaluating the risk and utility of private data. In addition, we propose two user-driven synthetic dataset hunting methods to generate a synthetic dataset with the specified privacy objective, enabling the data owner (e.g., the government and financial companies) to understand the possible privacy risk and thereby release datasets with confirmed privacy level. To the best of our knowledge, this is the first method that allows data providers to automatically generate synthetic datasets with a quantifiable privacy level for the service of open data.
doi_str_mv	10.1109/TSC.2023.3287239
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2875568622</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10155231</ieee_id><sourcerecordid>2875568622</sourcerecordid><originalsourceid>FETCH-LOGICAL-c245t-21aabece8771527b0eff942968651db55af5a7b823eb3a2fd323d59a1822276f3</originalsourceid><addsrcrecordid>eNpNkEFLAzEQRoMoWKt3Dx4WPG9NJpvN5iitVqFQpe05ZLcTmlJ3a5Iq--9NaQ-eBob3fcM8Qu4ZHTFG1dNyMR4BBT7iUEng6oIMgEvIKdDikgyY4ipnXBbX5CaELaUlVJUakPkqoM8n3v1gmy36Nm4wuiabmGgCxmyKLXoTXddmvy5uss-DaaOzztQ7zCbOWvSYFmaXfaQK0_S35MqaXcC78xyS1evLcvyWz-bT9_HzLG-gEDEHZkyNDVZSMgGypmitKkCVVSnYuhbCWGFkXQHHmhuwaw58LZRhFQDI0vIheTz17n33fcAQ9bY7-Dad1Ol_IVIRQKLoiWp8F4JHq_fefRnfa0b1UZtO2vRRmz5rS5GHU8Qh4j-cCQGc8T-w1mkT</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2875568622</pqid></control><display><type>article</type><title>User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy</title><source>IEEE Electronic Library (IEL)</source><creator>Tai, Bo-Chen ; Tsou, Yao-Tung ; Li, Szu-Chuang ; Huang, Yennun ; Tsai, Pei-Yuan ; Tsai, Yu-Cheng</creator><creatorcontrib>Tai, Bo-Chen ; Tsou, Yao-Tung ; Li, Szu-Chuang ; Huang, Yennun ; Tsai, Pei-Yuan ; Tsai, Yu-Cheng</creatorcontrib><description><![CDATA[Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals' records, which is in violation of regulations such as the European Union's General Data Protection Regulation. Differential privacy has been proposed as a possible solution to the aforementioned problem. The privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> in differential privacy is for theoretical interpretation, but in practice, its application in measuring the risk of data disclosure has not been well studied, especially with sampling-based synthetic datasets. Moreover, datasets released by data owners with quantifiable privacy levels and the explicit utility for these datasets have yet to be well developed. In this paper, we present an intuitive approach for defining the privacy level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, data hit rate and <inline-formula><tex-math notation="LaTeX">k</tex-math></inline-formula>-level) and utility level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, basic statistics and a series of data mining models), and the privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> is quantified for evaluating the risk and utility of private data. In addition, we propose two user-driven synthetic dataset hunting methods to generate a synthetic dataset with the specified privacy objective, enabling the data owner (<inline-formula><tex-math notation="LaTeX">e.g.</tex-math></inline-formula>, the government and financial companies) to understand the possible privacy risk and thereby release datasets with confirmed privacy level. To the best of our knowledge, this is the first method that allows data providers to automatically generate synthetic datasets with a quantifiable privacy level for the service of open data.]]></description><identifier>ISSN: 1939-1374</identifier><identifier>EISSN: 2372-0204</identifier><identifier>DOI: 10.1109/TSC.2023.3287239</identifier><identifier>CODEN: ITSCAD</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject><inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> k</tex-math> </inline-formula>-level ; Budgets ; data hit rate ; Data mining ; Data models ; Data privacy ; Data protection ; Datasets ; Differential privacy ; Distributed databases ; Government ; Privacy ; Risk ; Synthetic data ; synthetic dataset hunting</subject><ispartof>IEEE transactions on services computing, 2023-09, Vol.16 (5), p.1-14</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c245t-21aabece8771527b0eff942968651db55af5a7b823eb3a2fd323d59a1822276f3</cites><orcidid>0000-0002-7324-5135 ; 0000-0001-9312-0113 ; 0009-0004-3477-604X ; 0009-0003-4791-319X ; 0000-0001-6927-9562 ; 0009-0001-4030-8953</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10155231$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10155231$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tai, Bo-Chen</creatorcontrib><creatorcontrib>Tsou, Yao-Tung</creatorcontrib><creatorcontrib>Li, Szu-Chuang</creatorcontrib><creatorcontrib>Huang, Yennun</creatorcontrib><creatorcontrib>Tsai, Pei-Yuan</creatorcontrib><creatorcontrib>Tsai, Yu-Cheng</creatorcontrib><title>User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy</title><title>IEEE transactions on services computing</title><addtitle>TSC</addtitle><description><![CDATA[Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals' records, which is in violation of regulations such as the European Union's General Data Protection Regulation. Differential privacy has been proposed as a possible solution to the aforementioned problem. The privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> in differential privacy is for theoretical interpretation, but in practice, its application in measuring the risk of data disclosure has not been well studied, especially with sampling-based synthetic datasets. Moreover, datasets released by data owners with quantifiable privacy levels and the explicit utility for these datasets have yet to be well developed. In this paper, we present an intuitive approach for defining the privacy level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, data hit rate and <inline-formula><tex-math notation="LaTeX">k</tex-math></inline-formula>-level) and utility level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, basic statistics and a series of data mining models), and the privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> is quantified for evaluating the risk and utility of private data. In addition, we propose two user-driven synthetic dataset hunting methods to generate a synthetic dataset with the specified privacy objective, enabling the data owner (<inline-formula><tex-math notation="LaTeX">e.g.</tex-math></inline-formula>, the government and financial companies) to understand the possible privacy risk and thereby release datasets with confirmed privacy level. To the best of our knowledge, this is the first method that allows data providers to automatically generate synthetic datasets with a quantifiable privacy level for the service of open data.]]></description><subject><inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> k</tex-math> </inline-formula>-level</subject><subject>Budgets</subject><subject>data hit rate</subject><subject>Data mining</subject><subject>Data models</subject><subject>Data privacy</subject><subject>Data protection</subject><subject>Datasets</subject><subject>Differential privacy</subject><subject>Distributed databases</subject><subject>Government</subject><subject>Privacy</subject><subject>Risk</subject><subject>Synthetic data</subject><subject>synthetic dataset hunting</subject><issn>1939-1374</issn><issn>2372-0204</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkEFLAzEQRoMoWKt3Dx4WPG9NJpvN5iitVqFQpe05ZLcTmlJ3a5Iq--9NaQ-eBob3fcM8Qu4ZHTFG1dNyMR4BBT7iUEng6oIMgEvIKdDikgyY4ipnXBbX5CaELaUlVJUakPkqoM8n3v1gmy36Nm4wuiabmGgCxmyKLXoTXddmvy5uss-DaaOzztQ7zCbOWvSYFmaXfaQK0_S35MqaXcC78xyS1evLcvyWz-bT9_HzLG-gEDEHZkyNDVZSMgGypmitKkCVVSnYuhbCWGFkXQHHmhuwaw58LZRhFQDI0vIheTz17n33fcAQ9bY7-Dad1Ol_IVIRQKLoiWp8F4JHq_fefRnfa0b1UZtO2vRRmz5rS5GHU8Qh4j-cCQGc8T-w1mkT</recordid><startdate>20230901</startdate><enddate>20230901</enddate><creator>Tai, Bo-Chen</creator><creator>Tsou, Yao-Tung</creator><creator>Li, Szu-Chuang</creator><creator>Huang, Yennun</creator><creator>Tsai, Pei-Yuan</creator><creator>Tsai, Yu-Cheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7324-5135</orcidid><orcidid>https://orcid.org/0000-0001-9312-0113</orcidid><orcidid>https://orcid.org/0009-0004-3477-604X</orcidid><orcidid>https://orcid.org/0009-0003-4791-319X</orcidid><orcidid>https://orcid.org/0000-0001-6927-9562</orcidid><orcidid>https://orcid.org/0009-0001-4030-8953</orcidid></search><sort><creationdate>20230901</creationdate><title>User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy</title><author>Tai, Bo-Chen ; Tsou, Yao-Tung ; Li, Szu-Chuang ; Huang, Yennun ; Tsai, Pei-Yuan ; Tsai, Yu-Cheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c245t-21aabece8771527b0eff942968651db55af5a7b823eb3a2fd323d59a1822276f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic><inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> k</tex-math> </inline-formula>-level</topic><topic>Budgets</topic><topic>data hit rate</topic><topic>Data mining</topic><topic>Data models</topic><topic>Data privacy</topic><topic>Data protection</topic><topic>Datasets</topic><topic>Differential privacy</topic><topic>Distributed databases</topic><topic>Government</topic><topic>Privacy</topic><topic>Risk</topic><topic>Synthetic data</topic><topic>synthetic dataset hunting</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tai, Bo-Chen</creatorcontrib><creatorcontrib>Tsou, Yao-Tung</creatorcontrib><creatorcontrib>Li, Szu-Chuang</creatorcontrib><creatorcontrib>Huang, Yennun</creatorcontrib><creatorcontrib>Tsai, Pei-Yuan</creatorcontrib><creatorcontrib>Tsai, Yu-Cheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on services computing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tai, Bo-Chen</au><au>Tsou, Yao-Tung</au><au>Li, Szu-Chuang</au><au>Huang, Yennun</au><au>Tsai, Pei-Yuan</au><au>Tsai, Yu-Cheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy</atitle><jtitle>IEEE transactions on services computing</jtitle><stitle>TSC</stitle><date>2023-09-01</date><risdate>2023</risdate><volume>16</volume><issue>5</issue><spage>1</spage><epage>14</epage><pages>1-14</pages><issn>1939-1374</issn><eissn>2372-0204</eissn><coden>ITSCAD</coden><abstract><![CDATA[Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals' records, which is in violation of regulations such as the European Union's General Data Protection Regulation. Differential privacy has been proposed as a possible solution to the aforementioned problem. The privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> in differential privacy is for theoretical interpretation, but in practice, its application in measuring the risk of data disclosure has not been well studied, especially with sampling-based synthetic datasets. Moreover, datasets released by data owners with quantifiable privacy levels and the explicit utility for these datasets have yet to be well developed. In this paper, we present an intuitive approach for defining the privacy level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, data hit rate and <inline-formula><tex-math notation="LaTeX">k</tex-math></inline-formula>-level) and utility level (<inline-formula><tex-math notation="LaTeX">i.e.</tex-math></inline-formula>, basic statistics and a series of data mining models), and the privacy budget <inline-formula><tex-math notation="LaTeX">\varepsilon</tex-math></inline-formula> is quantified for evaluating the risk and utility of private data. In addition, we propose two user-driven synthetic dataset hunting methods to generate a synthetic dataset with the specified privacy objective, enabling the data owner (<inline-formula><tex-math notation="LaTeX">e.g.</tex-math></inline-formula>, the government and financial companies) to understand the possible privacy risk and thereby release datasets with confirmed privacy level. To the best of our knowledge, this is the first method that allows data providers to automatically generate synthetic datasets with a quantifiable privacy level for the service of open data.]]></abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TSC.2023.3287239</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-7324-5135</orcidid><orcidid>https://orcid.org/0000-0001-9312-0113</orcidid><orcidid>https://orcid.org/0009-0004-3477-604X</orcidid><orcidid>https://orcid.org/0009-0003-4791-319X</orcidid><orcidid>https://orcid.org/0000-0001-6927-9562</orcidid><orcidid>https://orcid.org/0009-0001-4030-8953</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1939-1374
ispartof	IEEE transactions on services computing, 2023-09, Vol.16 (5), p.1-14
issn	1939-1374 2372-0204
language	eng
recordid	cdi_proquest_journals_2875568622
source	IEEE Electronic Library (IEL)
subjects	<inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> k</tex-math> </inline-formula>-level Budgets data hit rate Data mining Data models Data privacy Data protection Datasets Differential privacy Distributed databases Government Privacy Risk Synthetic data synthetic dataset hunting
title	User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T22%3A35%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=User-Driven%20Synthetic%20Dataset%20Generation%20with%20Quantifiable%20Differential%20Privacy&rft.jtitle=IEEE%20transactions%20on%20services%20computing&rft.au=Tai,%20Bo-Chen&rft.date=2023-09-01&rft.volume=16&rft.issue=5&rft.spage=1&rft.epage=14&rft.pages=1-14&rft.issn=1939-1374&rft.eissn=2372-0204&rft.coden=ITSCAD&rft_id=info:doi/10.1109/TSC.2023.3287239&rft_dat=%3Cproquest_RIE%3E2875568622%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2875568622&rft_id=info:pmid/&rft_ieee_id=10155231&rfr_iscdi=true