Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to est...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:BMC bioinformatics 2015-11, Vol.16 (358), p.363-363, Article 363
Hauptverfasser: Blagus, Rok, Lusa, Lara
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 363
container_issue 358
container_start_page 363
container_title BMC bioinformatics
container_volume 16
creator Blagus, Rok
Lusa, Lara
description Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.
doi_str_mv 10.1186/s12859-015-0784-9
format Article
fullrecord <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_miscellaneous_1731793342</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A447997202</galeid><sourcerecordid>A447997202</sourcerecordid><originalsourceid>FETCH-LOGICAL-c473t-fae555e5a43a020634585748a54ca9d805456ca243dc9ef646a9880b44356f843</originalsourceid><addsrcrecordid>eNptkk1vFSEUhidGY2v1B7gxk7jRxVQYYIBl01itaWLix5pQONzSzMAIzI0u-t9l7q0f1xgWHOB5XzjhbZrnGJ1iLIY3GfeCyQ5h1iEuaCcfNMeYctz1GLGHf9VHzZOcbxHCXCD2uDnqB0a46Plxc_ch-lDaJUMbXRu3kLpWB9suwdYy62kefdi0BcxN8N8WyLtTk2LO3VaP3uriY2hdTG25gdbCFsY4T1A9V1DnDDnvltV-TmC92QmmaGHMT5tHTo8Znt3PJ83Xi7dfzt93Vx_fXZ6fXXWGclI6p4ExBkxTolGPBkKZYJwKzajR0taeKBuM7imxRoIb6KClEOiaUsIGJyg5aV7tfecU1yaKmnw2MI46QFyywpxgLgmhfUVf_oPexiWF-rpKcTkIVm_7Q230CMoHF0vSZjVVZ5RyKXmPVq_T_1B1WJi8iQGcr_sHgtcHgsoU-F42eslZXX7-dMjiPbv7jAROzclPOv1QGKk1HmofD1XjodZ4KFk1L-6bW64nsL8Vv_JAfgJ-ZbNt</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1779685054</pqid></control><display><type>article</type><title>Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>SpringerNature Journals</source><source>PubMed Central Open Access</source><source>Springer Nature OA Free Journals</source><source>PubMed Central</source><creator>Blagus, Rok ; Lusa, Lara</creator><creatorcontrib>Blagus, Rok ; Lusa, Lara</creatorcontrib><description>Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/s12859-015-0784-9</identifier><identifier>PMID: 26537827</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Analysis ; Area Under Curve ; Databases as Topic ; Humans ; Models, Theoretical ; Prediction (Logic) ; Probability ; Reproducibility of Results ; Resampling (Statistics) ; Treatment outcome</subject><ispartof>BMC bioinformatics, 2015-11, Vol.16 (358), p.363-363, Article 363</ispartof><rights>COPYRIGHT 2015 BioMed Central Ltd.</rights><rights>Copyright BioMed Central 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c473t-fae555e5a43a020634585748a54ca9d805456ca243dc9ef646a9880b44356f843</citedby><cites>FETCH-LOGICAL-c473t-fae555e5a43a020634585748a54ca9d805456ca243dc9ef646a9880b44356f843</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,864,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/26537827$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Blagus, Rok</creatorcontrib><creatorcontrib>Lusa, Lara</creatorcontrib><title>Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.</description><subject>Analysis</subject><subject>Area Under Curve</subject><subject>Databases as Topic</subject><subject>Humans</subject><subject>Models, Theoretical</subject><subject>Prediction (Logic)</subject><subject>Probability</subject><subject>Reproducibility of Results</subject><subject>Resampling (Statistics)</subject><subject>Treatment outcome</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNptkk1vFSEUhidGY2v1B7gxk7jRxVQYYIBl01itaWLix5pQONzSzMAIzI0u-t9l7q0f1xgWHOB5XzjhbZrnGJ1iLIY3GfeCyQ5h1iEuaCcfNMeYctz1GLGHf9VHzZOcbxHCXCD2uDnqB0a46Plxc_ch-lDaJUMbXRu3kLpWB9suwdYy62kefdi0BcxN8N8WyLtTk2LO3VaP3uriY2hdTG25gdbCFsY4T1A9V1DnDDnvltV-TmC92QmmaGHMT5tHTo8Znt3PJ83Xi7dfzt93Vx_fXZ6fXXWGclI6p4ExBkxTolGPBkKZYJwKzajR0taeKBuM7imxRoIb6KClEOiaUsIGJyg5aV7tfecU1yaKmnw2MI46QFyywpxgLgmhfUVf_oPexiWF-rpKcTkIVm_7Q230CMoHF0vSZjVVZ5RyKXmPVq_T_1B1WJi8iQGcr_sHgtcHgsoU-F42eslZXX7-dMjiPbv7jAROzclPOv1QGKk1HmofD1XjodZ4KFk1L-6bW64nsL8Vv_JAfgJ-ZbNt</recordid><startdate>20151104</startdate><enddate>20151104</enddate><creator>Blagus, Rok</creator><creator>Lusa, Lara</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7SC</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>L7M</scope><scope>LK8</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>7X8</scope></search><sort><creationdate>20151104</creationdate><title>Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models</title><author>Blagus, Rok ; Lusa, Lara</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c473t-fae555e5a43a020634585748a54ca9d805456ca243dc9ef646a9880b44356f843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Analysis</topic><topic>Area Under Curve</topic><topic>Databases as Topic</topic><topic>Humans</topic><topic>Models, Theoretical</topic><topic>Prediction (Logic)</topic><topic>Probability</topic><topic>Reproducibility of Results</topic><topic>Resampling (Statistics)</topic><topic>Treatment outcome</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Blagus, Rok</creatorcontrib><creatorcontrib>Lusa, Lara</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest Biological Science Collection</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Blagus, Rok</au><au>Lusa, Lara</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2015-11-04</date><risdate>2015</risdate><volume>16</volume><issue>358</issue><spage>363</spage><epage>363</epage><pages>363-363</pages><artnum>363</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>26537827</pmid><doi>10.1186/s12859-015-0784-9</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1471-2105
ispartof BMC bioinformatics, 2015-11, Vol.16 (358), p.363-363, Article 363
issn 1471-2105
1471-2105
language eng
recordid cdi_proquest_miscellaneous_1731793342
source MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; SpringerNature Journals; PubMed Central Open Access; Springer Nature OA Free Journals; PubMed Central
subjects Analysis
Area Under Curve
Databases as Topic
Humans
Models, Theoretical
Prediction (Logic)
Probability
Reproducibility of Results
Resampling (Statistics)
Treatment outcome
title Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T01%3A34%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Joint%20use%20of%20over-%20and%20under-sampling%20techniques%20and%20cross-validation%20for%20the%20development%20and%20assessment%20of%20prediction%20models&rft.jtitle=BMC%20bioinformatics&rft.au=Blagus,%20Rok&rft.date=2015-11-04&rft.volume=16&rft.issue=358&rft.spage=363&rft.epage=363&rft.pages=363-363&rft.artnum=363&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/s12859-015-0784-9&rft_dat=%3Cgale_proqu%3EA447997202%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1779685054&rft_id=info:pmid/26537827&rft_galeid=A447997202&rfr_iscdi=true