Generating synthetic data from marginal fitting for testing the efficacy of data-mining tools

Testing data-mining tools during their development, or comparing the accuracy of alternative tools designed to achieve the same goal, requires having instances of the data sets on which the tools will operate. The diverse and massive nature of input data sets limits the practicality of obtaining tes...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of production research 2006-07, Vol.44 (14), p.2711-2730
Hauptverfasser: Jeske, D. R., Gokhale, D. V., Ye, L.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Testing data-mining tools during their development, or comparing the accuracy of alternative tools designed to achieve the same goal, requires having instances of the data sets on which the tools will operate. The diverse and massive nature of input data sets limits the practicality of obtaining test data sets through statistical sampling, and accessibility to public data sets is often limited as a result of proprietary and privacy rights that protect many sources of data. A natural alternative to obtaining actual data sets is to generate synthetic data sets. Usually partial information about associations between attributes on the data set will be available. This paper addresses the problem of how to integrate all the partial information into a non-parametric synthetic data generation scheme. The goal is to devise a scheme that incorporates all the information that can be found about associations between attributes, but not to force additional structure (e.g. distribution assumptions) into the scheme. The key to the scheme is a classic algorithm from statistics, the iterative proportional-fitting algorithm, which is well known for facilitating the analysis of contingency table data. In this paper, we show how it can be used to achieve our stated goal.
ISSN:0020-7543
1366-588X
DOI:10.1080/00207540600622514