Distributed error estimation of functional dependency

Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences 2016-06, Vol.345, p.156-176
Hauptverfasser: Jin, Cheqing, Lall, Ashwin, (Jim) Xu, Jun, Zhang, Zhigang, Zhou, Aoying
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries—that are dual to each other in semantics—for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2016.01.051