Randomization methods for assessing data analysis results on real-valued matrices

Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Statistical analysis and data mining 2009-11, Vol.2 (4), p.209-230
Hauptverfasser: Ojala, Markus, Vuokko, Niko, Kallio, Aleksi, Haiminen, Niina, Mannila, Heikki
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that have the same row and column distributions of values as the original dataset. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column statistics, or due to some more interesting phenomena in the data. We study the problem of generating such randomized datasets. We describe methods based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. We also show how our methods can be applied to a real data analysis scenario on DNA microarray data. The results indicate that the methods work efficiently and are usable in significance testing of data mining results on real‐valued matrices. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company
ISSN:1932-1864
1932-1872
DOI:10.1002/sam.10042