Is cross-validation valid for small-sample microarray classification?
Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable unders...
Gespeichert in:
Veröffentlicht in: | Bioinformatics 2004-02, Vol.20 (3), p.374-380 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 380 |
---|---|
container_issue | 3 |
container_start_page | 374 |
container_title | Bioinformatics |
container_volume | 20 |
creator | Braga-Neto, Ulisses M. Dougherty, Edward R. |
description | Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples. |
doi_str_mv | 10.1093/bioinformatics/btg419 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_80157118</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>550192161</sourcerecordid><originalsourceid>FETCH-LOGICAL-c544t-c76b8d128c214398c6b257b64465d25194eed6c714b00538596112f3c21320ca3</originalsourceid><addsrcrecordid>eNqF0UFrFTEQB_AgFlurH0FZBL1tm0kmye5JpNS2UBGh0tJLyGazkprdfc3sE_vtje89LHrxlEB-8yczw9gr4EfAW3ncxTlOw5xHt0RPx93yDaF9wg4ANa8FV-3Tcpfa1Nhwuc-eE91xrgARn7F9wFZz1HjATi-o8nkmqn-4FPsSNk_V5lqV8IpGl1JNblylUI2xSJeze6h8ckRxiH5T8P4F2xtcovBydx6yrx9Pr07O68vPZxcnHy5rrxCX2hvdNT2IxgtA2TZed0KZTiNq1QsFLYbQa28Au_JX2ahWA4hBFi4F904esnfb3FWe79eBFjtG8iElN4V5TbbhoAxA818IrVBolC7wzT_wbl7nqTRRTKO1FlIWpLZoM6kcBrvKcXT5wQK3v7dh_96G3W6j1L3eha-7MfSPVbvxF_B2Bxx5l4bsJh_p0SklNG7aqbcu0hJ-_nl3-bvVRhplz29urflydX2mxCd7I38BRbKlzQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>198666233</pqid></control><display><type>article</type><title>Is cross-validation valid for small-sample microarray classification?</title><source>MEDLINE</source><source>Oxford Journals Open Access Collection</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Alma/SFX Local Collection</source><creator>Braga-Neto, Ulisses M. ; Dougherty, Edward R.</creator><creatorcontrib>Braga-Neto, Ulisses M. ; Dougherty, Edward R.</creatorcontrib><description>Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1460-2059</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btg419</identifier><identifier>PMID: 14960464</identifier><identifier>CODEN: BOINFP</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Algorithms ; Benchmarking - methods ; Biological and medical sciences ; Breast Neoplasms - diagnosis ; Breast Neoplasms - genetics ; Computer Simulation ; Fundamental and applied biological sciences. Psychology ; Gene Expression Profiling - methods ; General aspects ; Genetic Predisposition to Disease - genetics ; Genetic Testing - methods ; Humans ; Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) ; Models, Genetic ; Models, Statistical ; Oligonucleotide Array Sequence Analysis - methods ; Pattern Recognition, Automated ; Reproducibility of Results ; Sample Size ; Sensitivity and Specificity</subject><ispartof>Bioinformatics, 2004-02, Vol.20 (3), p.374-380</ispartof><rights>2004 INIST-CNRS</rights><rights>Copyright Oxford University Press(England) Feb 12, 2004</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c544t-c76b8d128c214398c6b257b64465d25194eed6c714b00538596112f3c21320ca3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=15526418$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/14960464$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Braga-Neto, Ulisses M.</creatorcontrib><creatorcontrib>Dougherty, Edward R.</creatorcontrib><title>Is cross-validation valid for small-sample microarray classification?</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.</description><subject>Algorithms</subject><subject>Benchmarking - methods</subject><subject>Biological and medical sciences</subject><subject>Breast Neoplasms - diagnosis</subject><subject>Breast Neoplasms - genetics</subject><subject>Computer Simulation</subject><subject>Fundamental and applied biological sciences. Psychology</subject><subject>Gene Expression Profiling - methods</subject><subject>General aspects</subject><subject>Genetic Predisposition to Disease - genetics</subject><subject>Genetic Testing - methods</subject><subject>Humans</subject><subject>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</subject><subject>Models, Genetic</subject><subject>Models, Statistical</subject><subject>Oligonucleotide Array Sequence Analysis - methods</subject><subject>Pattern Recognition, Automated</subject><subject>Reproducibility of Results</subject><subject>Sample Size</subject><subject>Sensitivity and Specificity</subject><issn>1367-4803</issn><issn>1460-2059</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2004</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqF0UFrFTEQB_AgFlurH0FZBL1tm0kmye5JpNS2UBGh0tJLyGazkprdfc3sE_vtje89LHrxlEB-8yczw9gr4EfAW3ncxTlOw5xHt0RPx93yDaF9wg4ANa8FV-3Tcpfa1Nhwuc-eE91xrgARn7F9wFZz1HjATi-o8nkmqn-4FPsSNk_V5lqV8IpGl1JNblylUI2xSJeze6h8ckRxiH5T8P4F2xtcovBydx6yrx9Pr07O68vPZxcnHy5rrxCX2hvdNT2IxgtA2TZed0KZTiNq1QsFLYbQa28Au_JX2ahWA4hBFi4F904esnfb3FWe79eBFjtG8iElN4V5TbbhoAxA818IrVBolC7wzT_wbl7nqTRRTKO1FlIWpLZoM6kcBrvKcXT5wQK3v7dh_96G3W6j1L3eha-7MfSPVbvxF_B2Bxx5l4bsJh_p0SklNG7aqbcu0hJ-_nl3-bvVRhplz29urflydX2mxCd7I38BRbKlzQ</recordid><startdate>20040212</startdate><enddate>20040212</enddate><creator>Braga-Neto, Ulisses M.</creator><creator>Dougherty, Edward R.</creator><general>Oxford University Press</general><general>Oxford Publishing Limited (England)</general><scope>BSCLL</scope><scope>IQODW</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TM</scope><scope>7TO</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>H8G</scope><scope>H94</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope></search><sort><creationdate>20040212</creationdate><title>Is cross-validation valid for small-sample microarray classification?</title><author>Braga-Neto, Ulisses M. ; Dougherty, Edward R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c544t-c76b8d128c214398c6b257b64465d25194eed6c714b00538596112f3c21320ca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2004</creationdate><topic>Algorithms</topic><topic>Benchmarking - methods</topic><topic>Biological and medical sciences</topic><topic>Breast Neoplasms - diagnosis</topic><topic>Breast Neoplasms - genetics</topic><topic>Computer Simulation</topic><topic>Fundamental and applied biological sciences. Psychology</topic><topic>Gene Expression Profiling - methods</topic><topic>General aspects</topic><topic>Genetic Predisposition to Disease - genetics</topic><topic>Genetic Testing - methods</topic><topic>Humans</topic><topic>Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects)</topic><topic>Models, Genetic</topic><topic>Models, Statistical</topic><topic>Oligonucleotide Array Sequence Analysis - methods</topic><topic>Pattern Recognition, Automated</topic><topic>Reproducibility of Results</topic><topic>Sample Size</topic><topic>Sensitivity and Specificity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Braga-Neto, Ulisses M.</creatorcontrib><creatorcontrib>Dougherty, Edward R.</creatorcontrib><collection>Istex</collection><collection>Pascal-Francis</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Copper Technical Reference Library</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Braga-Neto, Ulisses M.</au><au>Dougherty, Edward R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Is cross-validation valid for small-sample microarray classification?</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2004-02-12</date><risdate>2004</risdate><volume>20</volume><issue>3</issue><spage>374</spage><epage>380</epage><pages>374-380</pages><issn>1367-4803</issn><eissn>1460-2059</eissn><eissn>1367-4811</eissn><coden>BOINFP</coden><abstract>Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><pmid>14960464</pmid><doi>10.1093/bioinformatics/btg419</doi><tpages>7</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1367-4803 |
ispartof | Bioinformatics, 2004-02, Vol.20 (3), p.374-380 |
issn | 1367-4803 1460-2059 1367-4811 |
language | eng |
recordid | cdi_proquest_miscellaneous_80157118 |
source | MEDLINE; Oxford Journals Open Access Collection; EZB-FREE-00999 freely available EZB journals; Alma/SFX Local Collection |
subjects | Algorithms Benchmarking - methods Biological and medical sciences Breast Neoplasms - diagnosis Breast Neoplasms - genetics Computer Simulation Fundamental and applied biological sciences. Psychology Gene Expression Profiling - methods General aspects Genetic Predisposition to Disease - genetics Genetic Testing - methods Humans Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Genetic Models, Statistical Oligonucleotide Array Sequence Analysis - methods Pattern Recognition, Automated Reproducibility of Results Sample Size Sensitivity and Specificity |
title | Is cross-validation valid for small-sample microarray classification? |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T17%3A57%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Is%20cross-validation%20valid%20for%20small-sample%20microarray%20classification?&rft.jtitle=Bioinformatics&rft.au=Braga-Neto,%20Ulisses%20M.&rft.date=2004-02-12&rft.volume=20&rft.issue=3&rft.spage=374&rft.epage=380&rft.pages=374-380&rft.issn=1367-4803&rft.eissn=1460-2059&rft.coden=BOINFP&rft_id=info:doi/10.1093/bioinformatics/btg419&rft_dat=%3Cproquest_cross%3E550192161%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=198666233&rft_id=info:pmid/14960464&rfr_iscdi=true |