Distributed error estimation of functional dependency

Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences 2016-06, Vol.345, p.156-176
Hauptverfasser: Jin, Cheqing, Lall, Ashwin, (Jim) Xu, Jun, Zhang, Zhigang, Zhou, Aoying
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 176
container_issue
container_start_page 156
container_title Information sciences
container_volume 345
creator Jin, Cheqing
Lall, Ashwin
(Jim) Xu, Jun
Zhang, Zhigang
Zhou, Aoying
description Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries—that are dual to each other in semantics—for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.
doi_str_mv 10.1016/j.ins.2016.01.051
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1793278640</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0020025516000852</els_id><sourcerecordid>1793278640</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-9c980bf3ca61386f14a6b2bd55356ddbc939c9345ffcab18fe280f3164d57c3d3</originalsourceid><addsrcrecordid>eNp9kL1OxDAQhC0EEsfBA9ClpEnYtWMnERU6fqWTaKC2EnstOcolh50g3dvj01FT7RQzo52PsVuEAgHVfV_4MRY8yQKwAIlnbIV1xXPFGzxnKwAOOXApL9lVjD0AlJVSKyaffJyD75aZbEYhTCGjOPtdO_tpzCaXuWU0R90OmaU9jZZGc7hmF64dIt383TX7enn-3Lzl24_X983jNjeCyzlvTFND54RpFYpaOSxb1fHOSimksrYzjUgWUUrnTNth7YjX4ASq0srKCCvW7O7Uuw_T95Ie0zsfDQ1DO9K0RI1VI3hVqxKSFU9WE6YYAzm9D2lGOGgEfUSke50Q6SMiDagTopR5OGUobfjxFHQ0Pu0j6wOZWdvJ_5P-BZCJbvw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1793278640</pqid></control><display><type>article</type><title>Distributed error estimation of functional dependency</title><source>ScienceDirect Journals (5 years ago - present)</source><creator>Jin, Cheqing ; Lall, Ashwin ; (Jim) Xu, Jun ; Zhang, Zhigang ; Zhou, Aoying</creator><creatorcontrib>Jin, Cheqing ; Lall, Ashwin ; (Jim) Xu, Jun ; Zhang, Zhigang ; Zhou, Aoying</creatorcontrib><description>Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries—that are dual to each other in semantics—for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.</description><identifier>ISSN: 0020-0255</identifier><identifier>EISSN: 1872-6291</identifier><identifier>DOI: 10.1016/j.ins.2016.01.051</identifier><language>eng</language><publisher>Elsevier Inc</publisher><subject>Algorithms ; Cost engineering ; Data structures ; Distributed processing ; Effectiveness ; Error analysis ; Error estimation ; Errors ; Functional dependency ; Scanning ; Semantics</subject><ispartof>Information sciences, 2016-06, Vol.345, p.156-176</ispartof><rights>2016 Elsevier Inc.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c325t-9c980bf3ca61386f14a6b2bd55356ddbc939c9345ffcab18fe280f3164d57c3d3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ins.2016.01.051$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3548,27922,27923,45993</link.rule.ids></links><search><creatorcontrib>Jin, Cheqing</creatorcontrib><creatorcontrib>Lall, Ashwin</creatorcontrib><creatorcontrib>(Jim) Xu, Jun</creatorcontrib><creatorcontrib>Zhang, Zhigang</creatorcontrib><creatorcontrib>Zhou, Aoying</creatorcontrib><title>Distributed error estimation of functional dependency</title><title>Information sciences</title><description>Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries—that are dual to each other in semantics—for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.</description><subject>Algorithms</subject><subject>Cost engineering</subject><subject>Data structures</subject><subject>Distributed processing</subject><subject>Effectiveness</subject><subject>Error analysis</subject><subject>Error estimation</subject><subject>Errors</subject><subject>Functional dependency</subject><subject>Scanning</subject><subject>Semantics</subject><issn>0020-0255</issn><issn>1872-6291</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><recordid>eNp9kL1OxDAQhC0EEsfBA9ClpEnYtWMnERU6fqWTaKC2EnstOcolh50g3dvj01FT7RQzo52PsVuEAgHVfV_4MRY8yQKwAIlnbIV1xXPFGzxnKwAOOXApL9lVjD0AlJVSKyaffJyD75aZbEYhTCGjOPtdO_tpzCaXuWU0R90OmaU9jZZGc7hmF64dIt383TX7enn-3Lzl24_X983jNjeCyzlvTFND54RpFYpaOSxb1fHOSimksrYzjUgWUUrnTNth7YjX4ASq0srKCCvW7O7Uuw_T95Ie0zsfDQ1DO9K0RI1VI3hVqxKSFU9WE6YYAzm9D2lGOGgEfUSke50Q6SMiDagTopR5OGUobfjxFHQ0Pu0j6wOZWdvJ_5P-BZCJbvw</recordid><startdate>20160601</startdate><enddate>20160601</enddate><creator>Jin, Cheqing</creator><creator>Lall, Ashwin</creator><creator>(Jim) Xu, Jun</creator><creator>Zhang, Zhigang</creator><creator>Zhou, Aoying</creator><general>Elsevier Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20160601</creationdate><title>Distributed error estimation of functional dependency</title><author>Jin, Cheqing ; Lall, Ashwin ; (Jim) Xu, Jun ; Zhang, Zhigang ; Zhou, Aoying</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-9c980bf3ca61386f14a6b2bd55356ddbc939c9345ffcab18fe280f3164d57c3d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithms</topic><topic>Cost engineering</topic><topic>Data structures</topic><topic>Distributed processing</topic><topic>Effectiveness</topic><topic>Error analysis</topic><topic>Error estimation</topic><topic>Errors</topic><topic>Functional dependency</topic><topic>Scanning</topic><topic>Semantics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jin, Cheqing</creatorcontrib><creatorcontrib>Lall, Ashwin</creatorcontrib><creatorcontrib>(Jim) Xu, Jun</creatorcontrib><creatorcontrib>Zhang, Zhigang</creatorcontrib><creatorcontrib>Zhou, Aoying</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Information sciences</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jin, Cheqing</au><au>Lall, Ashwin</au><au>(Jim) Xu, Jun</au><au>Zhang, Zhigang</au><au>Zhou, Aoying</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distributed error estimation of functional dependency</atitle><jtitle>Information sciences</jtitle><date>2016-06-01</date><risdate>2016</risdate><volume>345</volume><spage>156</spage><epage>176</epage><pages>156-176</pages><issn>0020-0255</issn><eissn>1872-6291</eissn><abstract>Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries—that are dual to each other in semantics—for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.</abstract><pub>Elsevier Inc</pub><doi>10.1016/j.ins.2016.01.051</doi><tpages>21</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0020-0255
ispartof Information sciences, 2016-06, Vol.345, p.156-176
issn 0020-0255
1872-6291
language eng
recordid cdi_proquest_miscellaneous_1793278640
source ScienceDirect Journals (5 years ago - present)
subjects Algorithms
Cost engineering
Data structures
Distributed processing
Effectiveness
Error analysis
Error estimation
Errors
Functional dependency
Scanning
Semantics
title Distributed error estimation of functional dependency
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T18%3A29%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distributed%20error%20estimation%20of%20functional%20dependency&rft.jtitle=Information%20sciences&rft.au=Jin,%20Cheqing&rft.date=2016-06-01&rft.volume=345&rft.spage=156&rft.epage=176&rft.pages=156-176&rft.issn=0020-0255&rft.eissn=1872-6291&rft_id=info:doi/10.1016/j.ins.2016.01.051&rft_dat=%3Cproquest_cross%3E1793278640%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1793278640&rft_id=info:pmid/&rft_els_id=S0020025516000852&rfr_iscdi=true