Is cross-validation valid for small-sample microarray classification?

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable unders...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics 2004-02, Vol.20 (3), p.374-380
Hauptverfasser:	Braga-Neto, Ulisses M., Dougherty, Edward R.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Benchmarking - methods Biological and medical sciences Breast Neoplasms - diagnosis Breast Neoplasms - genetics Computer Simulation Fundamental and applied biological sciences. Psychology Gene Expression Profiling - methods General aspects Genetic Predisposition to Disease - genetics Genetic Testing - methods Humans Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Genetic Models, Statistical Oligonucleotide Array Sequence Analysis - methods Pattern Recognition, Automated Reproducibility of Results Sample Size Sensitivity and Specificity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.
ISSN:	1367-4803 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btg419