Inferring feature importance with uncertainties with application to large genotype data

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PLoS computational biology 2023-03, Vol.19 (3), p.e1010963-e1010963
Hauptverfasser: Johnsen, Pål Vegard, Strümke, Inga, Langaas, Mette, DeWan, Andrew Thomas, Riemer-Sørensen, Signe
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page e1010963
container_issue 3
container_start_page e1010963
container_title PLoS computational biology
container_volume 19
creator Johnsen, Pål Vegard
Strümke, Inga
Langaas, Mette
DeWan, Andrew Thomas
Riemer-Sørensen, Signe
description Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.
doi_str_mv 10.1371/journal.pcbi.1010963
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2802063893</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A744741131</galeid><doaj_id>oai_doaj_org_article_025e030ea1754b13914c99992fdd91dd</doaj_id><sourcerecordid>A744741131</sourcerecordid><originalsourceid>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</originalsourceid><addsrcrecordid>eNptkktv1DAQxyMEoqXwDRBE4gKHXfyIY_tUVRWPlSo4UImjNbGd1KusHWwH6LfHy6ZVF-GLx-Pf_OehqaqXGK0x5fj9NszRw7iedOfWGGEkW_qoOsWM0RWnTDx-YJ9Uz1LaIlRM2T6tTmgrMWcCn1bfN763MTo_1L2FPEdbu90UYgavbf3L5Zt6LlZ5O5-dTQcXTNPoNGQXfJ1DPUIcbD1YH_LtZGsDGZ5XT3oYk32x3GfV9ccP15efV1dfP20uL65WupSSV0IIzQnTGDMuUccI1kYySjoumOS25db0Xddi03NAYBsAyTQyRJCuEx3Qs-r1QXYaQ1LLSJIiAhHUUiFpITYHwgTYqim6HcRbFcCpv44QBwUxOz1ahQiziCILZTZNh6nEjZblkN4YiY0pWudLtrnbWaOtzxHGI9HjH-9u1BB-KlxGL4jgReHtohDDj9mmrHYuaTuO4G2YS-FctAIT0u7RN_-g_29voQYoHTjfh5JY70XVBW8a3mBMcaHeHVE6-Gx_5wHmlNTm25djtjmwOoaUou3v-8NI7Tfvrg613zy1bF4Je_VwNvdBd6tG_wBzZNXZ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2802063893</pqid></control><display><type>article</type><title>Inferring feature importance with uncertainties with application to large genotype data</title><source>Public Library of Science (PLoS) Journals Open Access</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><creator>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</creator><creatorcontrib>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</creatorcontrib><description>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1010963</identifier><identifier>PMID: 36917581</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Artificial intelligence ; Biobanks ; Biology and Life Sciences ; Computer and Information Sciences ; Confidence intervals ; Decomposition ; Engineering and Technology ; Estimation theory ; Expected values ; Game theory ; Genotype ; Genotype &amp; phenotype ; Genotypes ; Genotyping Techniques ; Machine learning ; Neural networks ; Obesity ; Physical Sciences ; Random variables ; Resampling ; Research and Analysis Methods ; Synthetic data ; Tree structures (Computers) ; Uncertainty ; Values</subject><ispartof>PLoS computational biology, 2023-03, Vol.19 (3), p.e1010963-e1010963</ispartof><rights>Copyright: © 2023 Johnsen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</rights><rights>COPYRIGHT 2023 Public Library of Science</rights><rights>2023 Johnsen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2023 Johnsen et al 2023 Johnsen et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</citedby><cites>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</cites><orcidid>0000-0002-5308-7651 ; 0000-0003-1820-6544 ; 0000-0002-2599-7914 ; 0000-0002-5714-0288</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79569,79570</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36917581$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Johnsen, Pål Vegard</creatorcontrib><creatorcontrib>Strümke, Inga</creatorcontrib><creatorcontrib>Langaas, Mette</creatorcontrib><creatorcontrib>DeWan, Andrew Thomas</creatorcontrib><creatorcontrib>Riemer-Sørensen, Signe</creatorcontrib><title>Inferring feature importance with uncertainties with application to large genotype data</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</description><subject>Artificial intelligence</subject><subject>Biobanks</subject><subject>Biology and Life Sciences</subject><subject>Computer and Information Sciences</subject><subject>Confidence intervals</subject><subject>Decomposition</subject><subject>Engineering and Technology</subject><subject>Estimation theory</subject><subject>Expected values</subject><subject>Game theory</subject><subject>Genotype</subject><subject>Genotype &amp; phenotype</subject><subject>Genotypes</subject><subject>Genotyping Techniques</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Obesity</subject><subject>Physical Sciences</subject><subject>Random variables</subject><subject>Resampling</subject><subject>Research and Analysis Methods</subject><subject>Synthetic data</subject><subject>Tree structures (Computers)</subject><subject>Uncertainty</subject><subject>Values</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNptkktv1DAQxyMEoqXwDRBE4gKHXfyIY_tUVRWPlSo4UImjNbGd1KusHWwH6LfHy6ZVF-GLx-Pf_OehqaqXGK0x5fj9NszRw7iedOfWGGEkW_qoOsWM0RWnTDx-YJ9Uz1LaIlRM2T6tTmgrMWcCn1bfN763MTo_1L2FPEdbu90UYgavbf3L5Zt6LlZ5O5-dTQcXTNPoNGQXfJ1DPUIcbD1YH_LtZGsDGZ5XT3oYk32x3GfV9ccP15efV1dfP20uL65WupSSV0IIzQnTGDMuUccI1kYySjoumOS25db0Xddi03NAYBsAyTQyRJCuEx3Qs-r1QXYaQ1LLSJIiAhHUUiFpITYHwgTYqim6HcRbFcCpv44QBwUxOz1ahQiziCILZTZNh6nEjZblkN4YiY0pWudLtrnbWaOtzxHGI9HjH-9u1BB-KlxGL4jgReHtohDDj9mmrHYuaTuO4G2YS-FctAIT0u7RN_-g_29voQYoHTjfh5JY70XVBW8a3mBMcaHeHVE6-Gx_5wHmlNTm25djtjmwOoaUou3v-8NI7Tfvrg613zy1bF4Je_VwNvdBd6tG_wBzZNXZ</recordid><startdate>20230301</startdate><enddate>20230301</enddate><creator>Johnsen, Pål Vegard</creator><creator>Strümke, Inga</creator><creator>Langaas, Mette</creator><creator>DeWan, Andrew Thomas</creator><creator>Riemer-Sørensen, Signe</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PJZUB</scope><scope>PKEHL</scope><scope>PPXIY</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-5308-7651</orcidid><orcidid>https://orcid.org/0000-0003-1820-6544</orcidid><orcidid>https://orcid.org/0000-0002-2599-7914</orcidid><orcidid>https://orcid.org/0000-0002-5714-0288</orcidid></search><sort><creationdate>20230301</creationdate><title>Inferring feature importance with uncertainties with application to large genotype data</title><author>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Biobanks</topic><topic>Biology and Life Sciences</topic><topic>Computer and Information Sciences</topic><topic>Confidence intervals</topic><topic>Decomposition</topic><topic>Engineering and Technology</topic><topic>Estimation theory</topic><topic>Expected values</topic><topic>Game theory</topic><topic>Genotype</topic><topic>Genotype &amp; phenotype</topic><topic>Genotypes</topic><topic>Genotyping Techniques</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Obesity</topic><topic>Physical Sciences</topic><topic>Random variables</topic><topic>Resampling</topic><topic>Research and Analysis Methods</topic><topic>Synthetic data</topic><topic>Tree structures (Computers)</topic><topic>Uncertainty</topic><topic>Values</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Johnsen, Pål Vegard</creatorcontrib><creatorcontrib>Strümke, Inga</creatorcontrib><creatorcontrib>Langaas, Mette</creatorcontrib><creatorcontrib>DeWan, Andrew Thomas</creatorcontrib><creatorcontrib>Riemer-Sørensen, Signe</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Computing Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest Health &amp; Medical Research Collection</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Health &amp; Nursing</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Johnsen, Pål Vegard</au><au>Strümke, Inga</au><au>Langaas, Mette</au><au>DeWan, Andrew Thomas</au><au>Riemer-Sørensen, Signe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Inferring feature importance with uncertainties with application to large genotype data</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2023-03-01</date><risdate>2023</risdate><volume>19</volume><issue>3</issue><spage>e1010963</spage><epage>e1010963</epage><pages>e1010963-e1010963</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>36917581</pmid><doi>10.1371/journal.pcbi.1010963</doi><tpages>e1010963</tpages><orcidid>https://orcid.org/0000-0002-5308-7651</orcidid><orcidid>https://orcid.org/0000-0003-1820-6544</orcidid><orcidid>https://orcid.org/0000-0002-2599-7914</orcidid><orcidid>https://orcid.org/0000-0002-5714-0288</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1553-7358
ispartof PLoS computational biology, 2023-03, Vol.19 (3), p.e1010963-e1010963
issn 1553-7358
1553-734X
1553-7358
language eng
recordid cdi_plos_journals_2802063893
source Public Library of Science (PLoS) Journals Open Access; MEDLINE; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central
subjects Artificial intelligence
Biobanks
Biology and Life Sciences
Computer and Information Sciences
Confidence intervals
Decomposition
Engineering and Technology
Estimation theory
Expected values
Game theory
Genotype
Genotype & phenotype
Genotypes
Genotyping Techniques
Machine learning
Neural networks
Obesity
Physical Sciences
Random variables
Resampling
Research and Analysis Methods
Synthetic data
Tree structures (Computers)
Uncertainty
Values
title Inferring feature importance with uncertainties with application to large genotype data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T17%3A00%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Inferring%20feature%20importance%20with%20uncertainties%20with%20application%20to%20large%20genotype%20data&rft.jtitle=PLoS%20computational%20biology&rft.au=Johnsen,%20P%C3%A5l%20Vegard&rft.date=2023-03-01&rft.volume=19&rft.issue=3&rft.spage=e1010963&rft.epage=e1010963&rft.pages=e1010963-e1010963&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1010963&rft_dat=%3Cgale_plos_%3EA744741131%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2802063893&rft_id=info:pmid/36917581&rft_galeid=A744741131&rft_doaj_id=oai_doaj_org_article_025e030ea1754b13914c99992fdd91dd&rfr_iscdi=true