Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample sel...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computational and mathematical methods in medicine 2017-01, Vol.2017 (2017), p.1-18
Hauptverfasser: Krautenbacher, Norbert, Fuchs, Christiane, Theis, Fabian J.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 18
container_issue 2017
container_start_page 1
container_title Computational and mathematical methods in medicine
container_volume 2017
creator Krautenbacher, Norbert
Fuchs, Christiane
Theis, Fabian J.
description Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
doi_str_mv 10.1155/2017/7847531
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_5632994</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1989562199</sourcerecordid><originalsourceid>FETCH-LOGICAL-c443t-cb68da1e5f4f893228f0144c286531c0d55d9fb8f4179b13dedc7f5ff634e9c23</originalsourceid><addsrcrecordid>eNqNkctvEzEQxi0EakvpjTPyEQm23fFj174gwYo-pEogUqTeLMc7bow262BvqPrf11HSFG5cZkaan755fIS8hfoUQMozVkN71irRSg4vyBG0QlVNC-rlvq5vD8nrnH_VtYRWwgE5ZJoDE404Ij-6mBK6KYx3tBtszsEHTJn6mOjMLlcD0hkOGyCO9EuwmYaR3tzH6vvCZqRdCVUXxynFgc6mdR8wvyGvvB0ynuzyMfl5_vWmu6yuv11cdZ-vKycEnyo3b1RvAaUXXmnOmPI1COGYasolru6l7LWfKy-g1XPgPfau9dL7hgvUjvFj8mmru1rPl6WJZQs7mFUKS5seTLTB_NsZw8LcxT9GNpxpLYrA-51Air_XmCezDNnhMNgR4zob0ErLhoHWBf24RV2KOSf0-zFQm40NZmOD2dlQ8Hd_r7aHn_5egA9bYBHG3t6H_5TDwqC3zzQIxhnnj2fwmnw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1989562199</pqid></control><display><type>article</type><title>Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies</title><source>PubMed Central Open Access</source><source>Wiley Online Library Open Access</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Krautenbacher, Norbert ; Fuchs, Christiane ; Theis, Fabian J.</creator><contributor>Schmid, Matthias</contributor><creatorcontrib>Krautenbacher, Norbert ; Fuchs, Christiane ; Theis, Fabian J. ; Schmid, Matthias</creatorcontrib><description>Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.</description><identifier>ISSN: 1748-670X</identifier><identifier>EISSN: 1748-6718</identifier><identifier>DOI: 10.1155/2017/7847531</identifier><identifier>PMID: 29312464</identifier><language>eng</language><publisher>Cairo, Egypt: Hindawi Publishing Corporation</publisher><ispartof>Computational and mathematical methods in medicine, 2017-01, Vol.2017 (2017), p.1-18</ispartof><rights>Copyright © 2017 Norbert Krautenbacher et al.</rights><rights>Copyright © 2017 Norbert Krautenbacher et al. 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c443t-cb68da1e5f4f893228f0144c286531c0d55d9fb8f4179b13dedc7f5ff634e9c23</citedby><cites>FETCH-LOGICAL-c443t-cb68da1e5f4f893228f0144c286531c0d55d9fb8f4179b13dedc7f5ff634e9c23</cites><orcidid>0000-0003-3565-8315 ; 0000-0002-7704-1543</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632994/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632994/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,27903,27904,53769,53771</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29312464$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Schmid, Matthias</contributor><creatorcontrib>Krautenbacher, Norbert</creatorcontrib><creatorcontrib>Fuchs, Christiane</creatorcontrib><creatorcontrib>Theis, Fabian J.</creatorcontrib><title>Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies</title><title>Computational and mathematical methods in medicine</title><addtitle>Comput Math Methods Med</addtitle><description>Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.</description><issn>1748-670X</issn><issn>1748-6718</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>RHX</sourceid><recordid>eNqNkctvEzEQxi0EakvpjTPyEQm23fFj174gwYo-pEogUqTeLMc7bow262BvqPrf11HSFG5cZkaan755fIS8hfoUQMozVkN71irRSg4vyBG0QlVNC-rlvq5vD8nrnH_VtYRWwgE5ZJoDE404Ij-6mBK6KYx3tBtszsEHTJn6mOjMLlcD0hkOGyCO9EuwmYaR3tzH6vvCZqRdCVUXxynFgc6mdR8wvyGvvB0ynuzyMfl5_vWmu6yuv11cdZ-vKycEnyo3b1RvAaUXXmnOmPI1COGYasolru6l7LWfKy-g1XPgPfau9dL7hgvUjvFj8mmru1rPl6WJZQs7mFUKS5seTLTB_NsZw8LcxT9GNpxpLYrA-51Air_XmCezDNnhMNgR4zob0ErLhoHWBf24RV2KOSf0-zFQm40NZmOD2dlQ8Hd_r7aHn_5egA9bYBHG3t6H_5TDwqC3zzQIxhnnj2fwmnw</recordid><startdate>20170101</startdate><enddate>20170101</enddate><creator>Krautenbacher, Norbert</creator><creator>Fuchs, Christiane</creator><creator>Theis, Fabian J.</creator><general>Hindawi Publishing Corporation</general><general>Hindawi</general><scope>ADJCN</scope><scope>AHFXO</scope><scope>RHU</scope><scope>RHW</scope><scope>RHX</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-3565-8315</orcidid><orcidid>https://orcid.org/0000-0002-7704-1543</orcidid></search><sort><creationdate>20170101</creationdate><title>Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies</title><author>Krautenbacher, Norbert ; Fuchs, Christiane ; Theis, Fabian J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c443t-cb68da1e5f4f893228f0144c286531c0d55d9fb8f4179b13dedc7f5ff634e9c23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Krautenbacher, Norbert</creatorcontrib><creatorcontrib>Fuchs, Christiane</creatorcontrib><creatorcontrib>Theis, Fabian J.</creatorcontrib><collection>الدوريات العلمية والإحصائية - e-Marefa Academic and Statistical Periodicals</collection><collection>معرفة - المحتوى العربي الأكاديمي المتكامل - e-Marefa Academic Complete</collection><collection>Hindawi Publishing Complete</collection><collection>Hindawi Publishing Subscription Journals</collection><collection>Hindawi Publishing Open Access</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Computational and mathematical methods in medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Krautenbacher, Norbert</au><au>Fuchs, Christiane</au><au>Theis, Fabian J.</au><au>Schmid, Matthias</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies</atitle><jtitle>Computational and mathematical methods in medicine</jtitle><addtitle>Comput Math Methods Med</addtitle><date>2017-01-01</date><risdate>2017</risdate><volume>2017</volume><issue>2017</issue><spage>1</spage><epage>18</epage><pages>1-18</pages><issn>1748-670X</issn><eissn>1748-6718</eissn><abstract>Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.</abstract><cop>Cairo, Egypt</cop><pub>Hindawi Publishing Corporation</pub><pmid>29312464</pmid><doi>10.1155/2017/7847531</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0003-3565-8315</orcidid><orcidid>https://orcid.org/0000-0002-7704-1543</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1748-670X
ispartof Computational and mathematical methods in medicine, 2017-01, Vol.2017 (2017), p.1-18
issn 1748-670X
1748-6718
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_5632994
source PubMed Central Open Access; Wiley Online Library Open Access; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
title Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T22%3A59%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Correcting%20Classifiers%20for%20Sample%20Selection%20Bias%20in%20Two-Phase%20Case-Control%20Studies&rft.jtitle=Computational%20and%20mathematical%20methods%20in%20medicine&rft.au=Krautenbacher,%20Norbert&rft.date=2017-01-01&rft.volume=2017&rft.issue=2017&rft.spage=1&rft.epage=18&rft.pages=1-18&rft.issn=1748-670X&rft.eissn=1748-6718&rft_id=info:doi/10.1155/2017/7847531&rft_dat=%3Cproquest_pubme%3E1989562199%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1989562199&rft_id=info:pmid/29312464&rfr_iscdi=true