Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation

With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample compo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one 2014-06, Vol.9 (6), p.e100335-e100335
Hauptverfasser: Soneson, Charlotte, Gerster, Sarah, Delorenzi, Mauro
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page e100335
container_issue 6
container_start_page e100335
container_title PloS one
container_volume 9
creator Soneson, Charlotte
Gerster, Sarah
Delorenzi, Mauro
description With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
doi_str_mv 10.1371/journal.pone.0100335
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_1540754993</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A417793084</galeid><doaj_id>oai_doaj_org_article_a6810f9ba64e4926a1f7708910723668</doaj_id><sourcerecordid>A417793084</sourcerecordid><originalsourceid>FETCH-LOGICAL-c692t-f2c5fdb7e715862b0cd679bd51f62c4f4b09bf7de2e0015f6dbec29fbb34bd9b3</originalsourceid><addsrcrecordid>eNqNk0lrGzEUx4fS0qRpv0FpBwqlPdjVMqMZXQpp6GIIBLpdhZYnW2Y8ciRNaL59ZXsSPCWHooO23_vr6S1F8RKjOaYN_rD2Q-hlN9_6HuYII0Rp_ag4xZySGSOIPj5anxTPYlwjVNOWsafFCak4axhlp4X-JJNelWAt6FRq31s_9Mb1y7IDaWKZfBlT8HmvnIyl68stBOvDRvYaSojJbWSCWHqVpOvBlOq21MHHOLuRnTMyOd8_L55Y2UV4Mc5nxa8vn39efJtdXn1dXJxfzjTjJM0s0bU1qoEG1y0jCmnDGq5MjS0jurKVQlzZxgABhHBtmVGgCbdK0UoZruhZ8fqgu-18FGN8osB1hZq64pxmYnEgjJdrsQ3Z-XArvHRif-DDUsiQnO5ASNZiZLmSrIKKEyaxbRrUcowaQhlrs9bH8bVBbcBo6FOQ3UR0etO7lVj6G5G9IYywLPBuFAj-esihFBsXNXSd7MEPe79zotuW4Iy--Qd9-HcjtZT5Ay6nMr-rd6LivMJNwylqq0zNH6DyMLBxuQDAunw-MXg_MchMgj9pKYcYxeLH9_9nr35P2bdH7Apkl1bRd8OuZOIUrA7gvrAC2PsgYyR2vXAXDbHrBTH2QjZ7dZyge6O74qd_AQ3RBPw</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1540754993</pqid></control><display><type>article</type><title>Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Public Library of Science (PLoS) Journals Open Access</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><creator>Soneson, Charlotte ; Gerster, Sarah ; Delorenzi, Mauro</creator><contributor>Zhang, Shu-Dong</contributor><creatorcontrib>Soneson, Charlotte ; Gerster, Sarah ; Delorenzi, Mauro ; Zhang, Shu-Dong</creatorcontrib><description>With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0100335</identifier><identifier>PMID: 24967636</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Analysis ; Artificial Intelligence ; Bias ; Bioinformatics ; Biology and Life Sciences ; Classification ; Classifiers ; Colorectal cancer ; Composition effects ; Computational Biology ; Datasets ; Experiments ; Gene expression ; Gene Expression Profiling ; Medical research ; Performance evaluation ; Performance prediction ; Physical Sciences ; Regression analysis ; Reproducibility of Results ; Research and Analysis Methods ; Researchers ; Science Policy ; Statistics as Topic - methods ; Studies ; Support vector machines ; Variables</subject><ispartof>PloS one, 2014-06, Vol.9 (6), p.e100335-e100335</ispartof><rights>COPYRIGHT 2014 Public Library of Science</rights><rights>2014 Soneson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2014 Soneson et al 2014 Soneson et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c692t-f2c5fdb7e715862b0cd679bd51f62c4f4b09bf7de2e0015f6dbec29fbb34bd9b3</citedby><cites>FETCH-LOGICAL-c692t-f2c5fdb7e715862b0cd679bd51f62c4f4b09bf7de2e0015f6dbec29fbb34bd9b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072626/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072626/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,2103,2929,23870,27928,27929,53795,53797</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/24967636$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Zhang, Shu-Dong</contributor><creatorcontrib>Soneson, Charlotte</creatorcontrib><creatorcontrib>Gerster, Sarah</creatorcontrib><creatorcontrib>Delorenzi, Mauro</creatorcontrib><title>Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.</description><subject>Analysis</subject><subject>Artificial Intelligence</subject><subject>Bias</subject><subject>Bioinformatics</subject><subject>Biology and Life Sciences</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Colorectal cancer</subject><subject>Composition effects</subject><subject>Computational Biology</subject><subject>Datasets</subject><subject>Experiments</subject><subject>Gene expression</subject><subject>Gene Expression Profiling</subject><subject>Medical research</subject><subject>Performance evaluation</subject><subject>Performance prediction</subject><subject>Physical Sciences</subject><subject>Regression analysis</subject><subject>Reproducibility of Results</subject><subject>Research and Analysis Methods</subject><subject>Researchers</subject><subject>Science Policy</subject><subject>Statistics as Topic - methods</subject><subject>Studies</subject><subject>Support vector machines</subject><subject>Variables</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNqNk0lrGzEUx4fS0qRpv0FpBwqlPdjVMqMZXQpp6GIIBLpdhZYnW2Y8ciRNaL59ZXsSPCWHooO23_vr6S1F8RKjOaYN_rD2Q-hlN9_6HuYII0Rp_ag4xZySGSOIPj5anxTPYlwjVNOWsafFCak4axhlp4X-JJNelWAt6FRq31s_9Mb1y7IDaWKZfBlT8HmvnIyl68stBOvDRvYaSojJbWSCWHqVpOvBlOq21MHHOLuRnTMyOd8_L55Y2UV4Mc5nxa8vn39efJtdXn1dXJxfzjTjJM0s0bU1qoEG1y0jCmnDGq5MjS0jurKVQlzZxgABhHBtmVGgCbdK0UoZruhZ8fqgu-18FGN8osB1hZq64pxmYnEgjJdrsQ3Z-XArvHRif-DDUsiQnO5ASNZiZLmSrIKKEyaxbRrUcowaQhlrs9bH8bVBbcBo6FOQ3UR0etO7lVj6G5G9IYywLPBuFAj-esihFBsXNXSd7MEPe79zotuW4Iy--Qd9-HcjtZT5Ay6nMr-rd6LivMJNwylqq0zNH6DyMLBxuQDAunw-MXg_MchMgj9pKYcYxeLH9_9nr35P2bdH7Apkl1bRd8OuZOIUrA7gvrAC2PsgYyR2vXAXDbHrBTH2QjZ7dZyge6O74qd_AQ3RBPw</recordid><startdate>20140626</startdate><enddate>20140626</enddate><creator>Soneson, Charlotte</creator><creator>Gerster, Sarah</creator><creator>Delorenzi, Mauro</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20140626</creationdate><title>Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation</title><author>Soneson, Charlotte ; Gerster, Sarah ; Delorenzi, Mauro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c692t-f2c5fdb7e715862b0cd679bd51f62c4f4b09bf7de2e0015f6dbec29fbb34bd9b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Analysis</topic><topic>Artificial Intelligence</topic><topic>Bias</topic><topic>Bioinformatics</topic><topic>Biology and Life Sciences</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Colorectal cancer</topic><topic>Composition effects</topic><topic>Computational Biology</topic><topic>Datasets</topic><topic>Experiments</topic><topic>Gene expression</topic><topic>Gene Expression Profiling</topic><topic>Medical research</topic><topic>Performance evaluation</topic><topic>Performance prediction</topic><topic>Physical Sciences</topic><topic>Regression analysis</topic><topic>Reproducibility of Results</topic><topic>Research and Analysis Methods</topic><topic>Researchers</topic><topic>Science Policy</topic><topic>Statistics as Topic - methods</topic><topic>Studies</topic><topic>Support vector machines</topic><topic>Variables</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Soneson, Charlotte</creatorcontrib><creatorcontrib>Gerster, Sarah</creatorcontrib><creatorcontrib>Delorenzi, Mauro</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing &amp; Allied Health Database</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological &amp; Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>ProQuest Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>Agricultural &amp; Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing &amp; Allied Health Database (Alumni Edition)</collection><collection>Meteorological &amp; Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Agricultural Science Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Soneson, Charlotte</au><au>Gerster, Sarah</au><au>Delorenzi, Mauro</au><au>Zhang, Shu-Dong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2014-06-26</date><risdate>2014</risdate><volume>9</volume><issue>6</issue><spage>e100335</spage><epage>e100335</epage><pages>e100335-e100335</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>24967636</pmid><doi>10.1371/journal.pone.0100335</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-6203
ispartof PloS one, 2014-06, Vol.9 (6), p.e100335-e100335
issn 1932-6203
1932-6203
language eng
recordid cdi_plos_journals_1540754993
source MEDLINE; DOAJ Directory of Open Access Journals; Public Library of Science (PLoS) Journals Open Access; EZB-FREE-00999 freely available EZB journals; PubMed Central; Free Full-Text Journals in Chemistry
subjects Analysis
Artificial Intelligence
Bias
Bioinformatics
Biology and Life Sciences
Classification
Classifiers
Colorectal cancer
Composition effects
Computational Biology
Datasets
Experiments
Gene expression
Gene Expression Profiling
Medical research
Performance evaluation
Performance prediction
Physical Sciences
Regression analysis
Reproducibility of Results
Research and Analysis Methods
Researchers
Science Policy
Statistics as Topic - methods
Studies
Support vector machines
Variables
title Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T13%3A34%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Batch%20effect%20confounding%20leads%20to%20strong%20bias%20in%20performance%20estimates%20obtained%20by%20cross-validation&rft.jtitle=PloS%20one&rft.au=Soneson,%20Charlotte&rft.date=2014-06-26&rft.volume=9&rft.issue=6&rft.spage=e100335&rft.epage=e100335&rft.pages=e100335-e100335&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0100335&rft_dat=%3Cgale_plos_%3EA417793084%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1540754993&rft_id=info:pmid/24967636&rft_galeid=A417793084&rft_doaj_id=oai_doaj_org_article_a6810f9ba64e4926a1f7708910723668&rfr_iscdi=true