Hide and Mine in Strings: Hardness, Algorithms, and Experiments

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the pro...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on knowledge and data engineering 2023-06, Vol.35 (6), p.5948-5963
Hauptverfasser:	Bernardini, Giulia, Conte, Alessio, Gourdel, Garance, Grossi, Roberto, Loukides, Grigorios, Pisanti, Nadia, Pissis, Solon P., Punzi, Giulia, Stougie, Leen, Sweering, Michelle
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Bioinformatics Computer Science Context Data integrity Data mining Data privacy data sanitization Datasets DNA frequent pattern mining Genomics Greedy algorithms Hardness Integer programming knowledge hiding Linear programming Pattern analysis Polynomials Privacy Resists string algorithms Strings
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	5963
container_issue	6
container_start_page	5948
container_title	IEEE transactions on knowledge and data engineering
container_volume	35
creator	Bernardini, Giulia Conte, Alessio Gourdel, Garance Grossi, Roberto Loukides, Grigorios Pisanti, Nadia Pissis, Solon P. Punzi, Giulia Stougie, Leen Sweering, Michelle
description	Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.
doi_str_mv	10.1109/TKDE.2022.3158063
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9732522</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9732522</ieee_id><sourcerecordid>2808828729</sourcerecordid><originalsourceid>FETCH-LOGICAL-c322t-7b41f4c47d2015bd1b7167f587016f7be211009ae7fb65cf90e07ebec2dbd7143</originalsourceid><addsrcrecordid>eNo9kE1Lw0AQhhdRsFZ_gHgJeBJMndnNZjdepNRqxYoH63nJx6RNqUndTUX_vRtSepqX4Xnn42XsEmGECMnd4vVxOuLA-Uig1BCLIzZAKXXIMcFjryHCMBKROmVnzq0BQCuNA_YwqwoK0roI3qqagqoOPlpb1Ut3H8xSW9Tk3G0w3iwbW7WrL687dPq7JVt9Ud26c3ZSphtHF_s6ZJ9P08VkFs7fn18m43mYC87bUGURllEeqYIDyqzATGGsSqkVYFyqjLj_ApKUVJnFMi8TIFCUUc6LrFAYiSG76eeu0o3Z-uWp_TNNWpnZeG66HghQIGP4Qc9e9-zWNt87cq1ZNztb-_MM16A114onnsKeym3jnKXyMBbBdJmaLlPTZWr2mXrPVe-piOjAJ0pwybn4BylLb7c</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2808828729</pqid></control><display><type>article</type><title>Hide and Mine in Strings: Hardness, Algorithms, and Experiments</title><source>IEEE Electronic Library (IEL)</source><creator>Bernardini, Giulia ; Conte, Alessio ; Gourdel, Garance ; Grossi, Roberto ; Loukides, Grigorios ; Pisanti, Nadia ; Pissis, Solon P. ; Punzi, Giulia ; Stougie, Leen ; Sweering, Michelle</creator><creatorcontrib>Bernardini, Giulia ; Conte, Alessio ; Gourdel, Garance ; Grossi, Roberto ; Loukides, Grigorios ; Pisanti, Nadia ; Pissis, Solon P. ; Punzi, Giulia ; Stougie, Leen ; Sweering, Michelle</creatorcontrib><description>Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2022.3158063</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Bioinformatics ; Computer Science ; Context ; Data integrity ; Data mining ; Data privacy ; data sanitization ; Datasets ; DNA ; frequent pattern mining ; Genomics ; Greedy algorithms ; Hardness ; Integer programming ; knowledge hiding ; Linear programming ; Pattern analysis ; Polynomials ; Privacy ; Resists ; string algorithms ; Strings</subject><ispartof>IEEE transactions on knowledge and data engineering, 2023-06, Vol.35 (6), p.5948-5963</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c322t-7b41f4c47d2015bd1b7167f587016f7be211009ae7fb65cf90e07ebec2dbd7143</cites><orcidid>0000-0003-1200-6015 ; 0000-0003-0888-5061 ; 0000-0002-1445-1932 ; 0000-0001-6647-088X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9732522$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,776,780,792,881,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9732522$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-03070560$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Bernardini, Giulia</creatorcontrib><creatorcontrib>Conte, Alessio</creatorcontrib><creatorcontrib>Gourdel, Garance</creatorcontrib><creatorcontrib>Grossi, Roberto</creatorcontrib><creatorcontrib>Loukides, Grigorios</creatorcontrib><creatorcontrib>Pisanti, Nadia</creatorcontrib><creatorcontrib>Pissis, Solon P.</creatorcontrib><creatorcontrib>Punzi, Giulia</creatorcontrib><creatorcontrib>Stougie, Leen</creatorcontrib><creatorcontrib>Sweering, Michelle</creatorcontrib><title>Hide and Mine in Strings: Hardness, Algorithms, and Experiments</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.</description><subject>Algorithms</subject><subject>Bioinformatics</subject><subject>Computer Science</subject><subject>Context</subject><subject>Data integrity</subject><subject>Data mining</subject><subject>Data privacy</subject><subject>data sanitization</subject><subject>Datasets</subject><subject>DNA</subject><subject>frequent pattern mining</subject><subject>Genomics</subject><subject>Greedy algorithms</subject><subject>Hardness</subject><subject>Integer programming</subject><subject>knowledge hiding</subject><subject>Linear programming</subject><subject>Pattern analysis</subject><subject>Polynomials</subject><subject>Privacy</subject><subject>Resists</subject><subject>string algorithms</subject><subject>Strings</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1Lw0AQhhdRsFZ_gHgJeBJMndnNZjdepNRqxYoH63nJx6RNqUndTUX_vRtSepqX4Xnn42XsEmGECMnd4vVxOuLA-Uig1BCLIzZAKXXIMcFjryHCMBKROmVnzq0BQCuNA_YwqwoK0roI3qqagqoOPlpb1Ut3H8xSW9Tk3G0w3iwbW7WrL687dPq7JVt9Ud26c3ZSphtHF_s6ZJ9P08VkFs7fn18m43mYC87bUGURllEeqYIDyqzATGGsSqkVYFyqjLj_ApKUVJnFMi8TIFCUUc6LrFAYiSG76eeu0o3Z-uWp_TNNWpnZeG66HghQIGP4Qc9e9-zWNt87cq1ZNztb-_MM16A114onnsKeym3jnKXyMBbBdJmaLlPTZWr2mXrPVe-piOjAJ0pwybn4BylLb7c</recordid><startdate>20230601</startdate><enddate>20230601</enddate><creator>Bernardini, Giulia</creator><creator>Conte, Alessio</creator><creator>Gourdel, Garance</creator><creator>Grossi, Roberto</creator><creator>Loukides, Grigorios</creator><creator>Pisanti, Nadia</creator><creator>Pissis, Solon P.</creator><creator>Punzi, Giulia</creator><creator>Stougie, Leen</creator><creator>Sweering, Michelle</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0003-1200-6015</orcidid><orcidid>https://orcid.org/0000-0003-0888-5061</orcidid><orcidid>https://orcid.org/0000-0002-1445-1932</orcidid><orcidid>https://orcid.org/0000-0001-6647-088X</orcidid></search><sort><creationdate>20230601</creationdate><title>Hide and Mine in Strings: Hardness, Algorithms, and Experiments</title><author>Bernardini, Giulia ; Conte, Alessio ; Gourdel, Garance ; Grossi, Roberto ; Loukides, Grigorios ; Pisanti, Nadia ; Pissis, Solon P. ; Punzi, Giulia ; Stougie, Leen ; Sweering, Michelle</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c322t-7b41f4c47d2015bd1b7167f587016f7be211009ae7fb65cf90e07ebec2dbd7143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Bioinformatics</topic><topic>Computer Science</topic><topic>Context</topic><topic>Data integrity</topic><topic>Data mining</topic><topic>Data privacy</topic><topic>data sanitization</topic><topic>Datasets</topic><topic>DNA</topic><topic>frequent pattern mining</topic><topic>Genomics</topic><topic>Greedy algorithms</topic><topic>Hardness</topic><topic>Integer programming</topic><topic>knowledge hiding</topic><topic>Linear programming</topic><topic>Pattern analysis</topic><topic>Polynomials</topic><topic>Privacy</topic><topic>Resists</topic><topic>string algorithms</topic><topic>Strings</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bernardini, Giulia</creatorcontrib><creatorcontrib>Conte, Alessio</creatorcontrib><creatorcontrib>Gourdel, Garance</creatorcontrib><creatorcontrib>Grossi, Roberto</creatorcontrib><creatorcontrib>Loukides, Grigorios</creatorcontrib><creatorcontrib>Pisanti, Nadia</creatorcontrib><creatorcontrib>Pissis, Solon P.</creatorcontrib><creatorcontrib>Punzi, Giulia</creatorcontrib><creatorcontrib>Stougie, Leen</creatorcontrib><creatorcontrib>Sweering, Michelle</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bernardini, Giulia</au><au>Conte, Alessio</au><au>Gourdel, Garance</au><au>Grossi, Roberto</au><au>Loukides, Grigorios</au><au>Pisanti, Nadia</au><au>Pissis, Solon P.</au><au>Punzi, Giulia</au><au>Stougie, Leen</au><au>Sweering, Michelle</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hide and Mine in Strings: Hardness, Algorithms, and Experiments</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2023-06-01</date><risdate>2023</risdate><volume>35</volume><issue>6</issue><spage>5948</spage><epage>5963</epage><pages>5948-5963</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TKDE.2022.3158063</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1200-6015</orcidid><orcidid>https://orcid.org/0000-0003-0888-5061</orcidid><orcidid>https://orcid.org/0000-0002-1445-1932</orcidid><orcidid>https://orcid.org/0000-0001-6647-088X</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1041-4347
ispartof	IEEE transactions on knowledge and data engineering, 2023-06, Vol.35 (6), p.5948-5963
issn	1041-4347 1558-2191
language	eng
recordid	cdi_ieee_primary_9732522
source	IEEE Electronic Library (IEL)
subjects	Algorithms Bioinformatics Computer Science Context Data integrity Data mining Data privacy data sanitization Datasets DNA frequent pattern mining Genomics Greedy algorithms Hardness Integer programming knowledge hiding Linear programming Pattern analysis Polynomials Privacy Resists string algorithms Strings
title	Hide and Mine in Strings: Hardness, Algorithms, and Experiments
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T05%3A23%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hide%20and%20Mine%20in%20Strings:%20Hardness,%20Algorithms,%20and%20Experiments&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Bernardini,%20Giulia&rft.date=2023-06-01&rft.volume=35&rft.issue=6&rft.spage=5948&rft.epage=5963&rft.pages=5948-5963&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2022.3158063&rft_dat=%3Cproquest_RIE%3E2808828729%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2808828729&rft_id=info:pmid/&rft_ieee_id=9732522&rfr_iscdi=true