Discovering Conditional Functional Dependencies

This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2011-05, Vol.23 (5), p.683-698
Hauptverfasser: Wenfei Fan, Geerts, F, Jianzhong Li, Ming Xiong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 698
container_issue 5
container_start_page 683
container_title IEEE transactions on knowledge and data engineering
container_volume 23
creator Wenfei Fan
Geerts, F
Jianzhong Li
Ming Xiong
description This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.
doi_str_mv 10.1109/TKDE.2010.154
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_5560658</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5560658</ieee_id><sourcerecordid>10_1109_TKDE_2010_154</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-28d26101fe8a6aec99523ecb64abe4047dc11c753add9816b4cfc776d605ebff3</originalsourceid><addsrcrecordid>eNo9j01LAzEURYMoWKtLV276B9LmZfI1S5lpVSy4qeuQSV4kUjNlUgX_vTO0uLr3wuHCIeQe2BKA1avda7tecjZNKS7IDKQ0lEMNl2NnAqiohL4mN6V8MsaMNjAjqzYV3__gkPLHoulzSMfUZ7dfbL6zP9cWD5gDZp-w3JKr6PYF7845J--b9a55ptu3p5fmcUs9r9WRchO4AgYRjVMOfV1LXqHvlHAdCiZ08ABey8qFUBtQnfDRa62CYhK7GKs5oadfP_SlDBjtYUhfbvi1wOxkaydbO9na0XbkH058QsR_VkrFlDTVH-rVUOI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Discovering Conditional Functional Dependencies</title><source>IEEE Xplore</source><creator>Wenfei Fan ; Geerts, F ; Jianzhong Li ; Ming Xiong</creator><creatorcontrib>Wenfei Fan ; Geerts, F ; Jianzhong Li ; Ming Xiong</creatorcontrib><description>This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2010.154</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>IEEE</publisher><subject>Association rules ; Cleaning ; closed item set ; Computational fluid dynamics ; conditional functional dependency ; free item set ; functional dependency ; Integrity ; Itemsets ; Pattern matching ; Semantics</subject><ispartof>IEEE transactions on knowledge and data engineering, 2011-05, Vol.23 (5), p.683-698</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-28d26101fe8a6aec99523ecb64abe4047dc11c753add9816b4cfc776d605ebff3</citedby><cites>FETCH-LOGICAL-c296t-28d26101fe8a6aec99523ecb64abe4047dc11c753add9816b4cfc776d605ebff3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5560658$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5560658$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wenfei Fan</creatorcontrib><creatorcontrib>Geerts, F</creatorcontrib><creatorcontrib>Jianzhong Li</creatorcontrib><creatorcontrib>Ming Xiong</creatorcontrib><title>Discovering Conditional Functional Dependencies</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.</description><subject>Association rules</subject><subject>Cleaning</subject><subject>closed item set</subject><subject>Computational fluid dynamics</subject><subject>conditional functional dependency</subject><subject>free item set</subject><subject>functional dependency</subject><subject>Integrity</subject><subject>Itemsets</subject><subject>Pattern matching</subject><subject>Semantics</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9j01LAzEURYMoWKtLV276B9LmZfI1S5lpVSy4qeuQSV4kUjNlUgX_vTO0uLr3wuHCIeQe2BKA1avda7tecjZNKS7IDKQ0lEMNl2NnAqiohL4mN6V8MsaMNjAjqzYV3__gkPLHoulzSMfUZ7dfbL6zP9cWD5gDZp-w3JKr6PYF7845J--b9a55ptu3p5fmcUs9r9WRchO4AgYRjVMOfV1LXqHvlHAdCiZ08ABey8qFUBtQnfDRa62CYhK7GKs5oadfP_SlDBjtYUhfbvi1wOxkaydbO9na0XbkH058QsR_VkrFlDTVH-rVUOI</recordid><startdate>20110501</startdate><enddate>20110501</enddate><creator>Wenfei Fan</creator><creator>Geerts, F</creator><creator>Jianzhong Li</creator><creator>Ming Xiong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20110501</creationdate><title>Discovering Conditional Functional Dependencies</title><author>Wenfei Fan ; Geerts, F ; Jianzhong Li ; Ming Xiong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-28d26101fe8a6aec99523ecb64abe4047dc11c753add9816b4cfc776d605ebff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Association rules</topic><topic>Cleaning</topic><topic>closed item set</topic><topic>Computational fluid dynamics</topic><topic>conditional functional dependency</topic><topic>free item set</topic><topic>functional dependency</topic><topic>Integrity</topic><topic>Itemsets</topic><topic>Pattern matching</topic><topic>Semantics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wenfei Fan</creatorcontrib><creatorcontrib>Geerts, F</creatorcontrib><creatorcontrib>Jianzhong Li</creatorcontrib><creatorcontrib>Ming Xiong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wenfei Fan</au><au>Geerts, F</au><au>Jianzhong Li</au><au>Ming Xiong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Discovering Conditional Functional Dependencies</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2011-05-01</date><risdate>2011</risdate><volume>23</volume><issue>5</issue><spage>683</spage><epage>698</epage><pages>683-698</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.</abstract><pub>IEEE</pub><doi>10.1109/TKDE.2010.154</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2011-05, Vol.23 (5), p.683-698
issn 1041-4347
1558-2191
language eng
recordid cdi_ieee_primary_5560658
source IEEE Xplore
subjects Association rules
Cleaning
closed item set
Computational fluid dynamics
conditional functional dependency
free item set
functional dependency
Integrity
Itemsets
Pattern matching
Semantics
title Discovering Conditional Functional Dependencies
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T06%3A09%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Discovering%20Conditional%20Functional%20Dependencies&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Wenfei%20Fan&rft.date=2011-05-01&rft.volume=23&rft.issue=5&rft.spage=683&rft.epage=698&rft.pages=683-698&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2010.154&rft_dat=%3Ccrossref_RIE%3E10_1109_TKDE_2010_154%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5560658&rfr_iscdi=true