Missing value estimation methods for DNA microarrays

Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics 2001-06, Vol.17 (6), p.520-525
Hauptverfasser:	Troyanskaya, Olga, Cantor, Michael, Sherlock, Gavin, Brown, Pat, Hastie, Trevor, Tibshirani, Robert, Botstein, David, Altman, Russ B.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Biological and medical sciences Cell Cycle - genetics Cluster Analysis Data Display Data Interpretation, Statistical DNA microarrays Fundamental and applied biological sciences. Psychology Gene Expression General aspects Mathematical Computing Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Multigene Family Oligonucleotide Array Sequence Analysis - statistics & numerical data Saccharomyces cerevisiae - genetics Sensitivity and Specificity Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1–20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions. Availability: The software is available at http://smi-web.stanford.edu/projects/helix/pubs/impute/ Contact: russ.altman@stanford.edu * To whom correspondence should be addressed.
ISSN:	1367-4803 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/17.6.520