Inferring feature importance with uncertainties with application to large genotype data
Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generatin...
Gespeichert in:
Veröffentlicht in: | PLoS computational biology 2023-03, Vol.19 (3), p.e1010963-e1010963 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | e1010963 |
---|---|
container_issue | 3 |
container_start_page | e1010963 |
container_title | PLoS computational biology |
container_volume | 19 |
creator | Johnsen, Pål Vegard Strümke, Inga Langaas, Mette DeWan, Andrew Thomas Riemer-Sørensen, Signe |
description | Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity. |
doi_str_mv | 10.1371/journal.pcbi.1010963 |
format | Article |
fullrecord | <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2802063893</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A744741131</galeid><doaj_id>oai_doaj_org_article_025e030ea1754b13914c99992fdd91dd</doaj_id><sourcerecordid>A744741131</sourcerecordid><originalsourceid>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</originalsourceid><addsrcrecordid>eNptkktv1DAQxyMEoqXwDRBE4gKHXfyIY_tUVRWPlSo4UImjNbGd1KusHWwH6LfHy6ZVF-GLx-Pf_OehqaqXGK0x5fj9NszRw7iedOfWGGEkW_qoOsWM0RWnTDx-YJ9Uz1LaIlRM2T6tTmgrMWcCn1bfN763MTo_1L2FPEdbu90UYgavbf3L5Zt6LlZ5O5-dTQcXTNPoNGQXfJ1DPUIcbD1YH_LtZGsDGZ5XT3oYk32x3GfV9ccP15efV1dfP20uL65WupSSV0IIzQnTGDMuUccI1kYySjoumOS25db0Xddi03NAYBsAyTQyRJCuEx3Qs-r1QXYaQ1LLSJIiAhHUUiFpITYHwgTYqim6HcRbFcCpv44QBwUxOz1ahQiziCILZTZNh6nEjZblkN4YiY0pWudLtrnbWaOtzxHGI9HjH-9u1BB-KlxGL4jgReHtohDDj9mmrHYuaTuO4G2YS-FctAIT0u7RN_-g_29voQYoHTjfh5JY70XVBW8a3mBMcaHeHVE6-Gx_5wHmlNTm25djtjmwOoaUou3v-8NI7Tfvrg613zy1bF4Je_VwNvdBd6tG_wBzZNXZ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2802063893</pqid></control><display><type>article</type><title>Inferring feature importance with uncertainties with application to large genotype data</title><source>Public Library of Science (PLoS) Journals Open Access</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><creator>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</creator><creatorcontrib>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</creatorcontrib><description>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1010963</identifier><identifier>PMID: 36917581</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Artificial intelligence ; Biobanks ; Biology and Life Sciences ; Computer and Information Sciences ; Confidence intervals ; Decomposition ; Engineering and Technology ; Estimation theory ; Expected values ; Game theory ; Genotype ; Genotype & phenotype ; Genotypes ; Genotyping Techniques ; Machine learning ; Neural networks ; Obesity ; Physical Sciences ; Random variables ; Resampling ; Research and Analysis Methods ; Synthetic data ; Tree structures (Computers) ; Uncertainty ; Values</subject><ispartof>PLoS computational biology, 2023-03, Vol.19 (3), p.e1010963-e1010963</ispartof><rights>Copyright: © 2023 Johnsen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</rights><rights>COPYRIGHT 2023 Public Library of Science</rights><rights>2023 Johnsen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2023 Johnsen et al 2023 Johnsen et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</citedby><cites>FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</cites><orcidid>0000-0002-5308-7651 ; 0000-0003-1820-6544 ; 0000-0002-2599-7914 ; 0000-0002-5714-0288</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038287/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79569,79570</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36917581$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Johnsen, Pål Vegard</creatorcontrib><creatorcontrib>Strümke, Inga</creatorcontrib><creatorcontrib>Langaas, Mette</creatorcontrib><creatorcontrib>DeWan, Andrew Thomas</creatorcontrib><creatorcontrib>Riemer-Sørensen, Signe</creatorcontrib><title>Inferring feature importance with uncertainties with application to large genotype data</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</description><subject>Artificial intelligence</subject><subject>Biobanks</subject><subject>Biology and Life Sciences</subject><subject>Computer and Information Sciences</subject><subject>Confidence intervals</subject><subject>Decomposition</subject><subject>Engineering and Technology</subject><subject>Estimation theory</subject><subject>Expected values</subject><subject>Game theory</subject><subject>Genotype</subject><subject>Genotype & phenotype</subject><subject>Genotypes</subject><subject>Genotyping Techniques</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Obesity</subject><subject>Physical Sciences</subject><subject>Random variables</subject><subject>Resampling</subject><subject>Research and Analysis Methods</subject><subject>Synthetic data</subject><subject>Tree structures (Computers)</subject><subject>Uncertainty</subject><subject>Values</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNptkktv1DAQxyMEoqXwDRBE4gKHXfyIY_tUVRWPlSo4UImjNbGd1KusHWwH6LfHy6ZVF-GLx-Pf_OehqaqXGK0x5fj9NszRw7iedOfWGGEkW_qoOsWM0RWnTDx-YJ9Uz1LaIlRM2T6tTmgrMWcCn1bfN763MTo_1L2FPEdbu90UYgavbf3L5Zt6LlZ5O5-dTQcXTNPoNGQXfJ1DPUIcbD1YH_LtZGsDGZ5XT3oYk32x3GfV9ccP15efV1dfP20uL65WupSSV0IIzQnTGDMuUccI1kYySjoumOS25db0Xddi03NAYBsAyTQyRJCuEx3Qs-r1QXYaQ1LLSJIiAhHUUiFpITYHwgTYqim6HcRbFcCpv44QBwUxOz1ahQiziCILZTZNh6nEjZblkN4YiY0pWudLtrnbWaOtzxHGI9HjH-9u1BB-KlxGL4jgReHtohDDj9mmrHYuaTuO4G2YS-FctAIT0u7RN_-g_29voQYoHTjfh5JY70XVBW8a3mBMcaHeHVE6-Gx_5wHmlNTm25djtjmwOoaUou3v-8NI7Tfvrg613zy1bF4Je_VwNvdBd6tG_wBzZNXZ</recordid><startdate>20230301</startdate><enddate>20230301</enddate><creator>Johnsen, Pål Vegard</creator><creator>Strümke, Inga</creator><creator>Langaas, Mette</creator><creator>DeWan, Andrew Thomas</creator><creator>Riemer-Sørensen, Signe</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PJZUB</scope><scope>PKEHL</scope><scope>PPXIY</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-5308-7651</orcidid><orcidid>https://orcid.org/0000-0003-1820-6544</orcidid><orcidid>https://orcid.org/0000-0002-2599-7914</orcidid><orcidid>https://orcid.org/0000-0002-5714-0288</orcidid></search><sort><creationdate>20230301</creationdate><title>Inferring feature importance with uncertainties with application to large genotype data</title><author>Johnsen, Pål Vegard ; Strümke, Inga ; Langaas, Mette ; DeWan, Andrew Thomas ; Riemer-Sørensen, Signe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c589t-888c725c115790b521cd9532b78597e67edfbb61df7a0ae4aa95c0d282bb8ba3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Biobanks</topic><topic>Biology and Life Sciences</topic><topic>Computer and Information Sciences</topic><topic>Confidence intervals</topic><topic>Decomposition</topic><topic>Engineering and Technology</topic><topic>Estimation theory</topic><topic>Expected values</topic><topic>Game theory</topic><topic>Genotype</topic><topic>Genotype & phenotype</topic><topic>Genotypes</topic><topic>Genotyping Techniques</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Obesity</topic><topic>Physical Sciences</topic><topic>Random variables</topic><topic>Resampling</topic><topic>Research and Analysis Methods</topic><topic>Synthetic data</topic><topic>Tree structures (Computers)</topic><topic>Uncertainty</topic><topic>Values</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Johnsen, Pål Vegard</creatorcontrib><creatorcontrib>Strümke, Inga</creatorcontrib><creatorcontrib>Langaas, Mette</creatorcontrib><creatorcontrib>DeWan, Andrew Thomas</creatorcontrib><creatorcontrib>Riemer-Sørensen, Signe</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest Health & Medical Research Collection</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Health & Nursing</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Johnsen, Pål Vegard</au><au>Strümke, Inga</au><au>Langaas, Mette</au><au>DeWan, Andrew Thomas</au><au>Riemer-Sørensen, Signe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Inferring feature importance with uncertainties with application to large genotype data</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2023-03-01</date><risdate>2023</risdate><volume>19</volume><issue>3</issue><spage>e1010963</spage><epage>e1010963</epage><pages>e1010963-e1010963</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>36917581</pmid><doi>10.1371/journal.pcbi.1010963</doi><tpages>e1010963</tpages><orcidid>https://orcid.org/0000-0002-5308-7651</orcidid><orcidid>https://orcid.org/0000-0003-1820-6544</orcidid><orcidid>https://orcid.org/0000-0002-2599-7914</orcidid><orcidid>https://orcid.org/0000-0002-5714-0288</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1553-7358 |
ispartof | PLoS computational biology, 2023-03, Vol.19 (3), p.e1010963-e1010963 |
issn | 1553-7358 1553-734X 1553-7358 |
language | eng |
recordid | cdi_plos_journals_2802063893 |
source | Public Library of Science (PLoS) Journals Open Access; MEDLINE; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central |
subjects | Artificial intelligence Biobanks Biology and Life Sciences Computer and Information Sciences Confidence intervals Decomposition Engineering and Technology Estimation theory Expected values Game theory Genotype Genotype & phenotype Genotypes Genotyping Techniques Machine learning Neural networks Obesity Physical Sciences Random variables Resampling Research and Analysis Methods Synthetic data Tree structures (Computers) Uncertainty Values |
title | Inferring feature importance with uncertainties with application to large genotype data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T17%3A00%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Inferring%20feature%20importance%20with%20uncertainties%20with%20application%20to%20large%20genotype%20data&rft.jtitle=PLoS%20computational%20biology&rft.au=Johnsen,%20P%C3%A5l%20Vegard&rft.date=2023-03-01&rft.volume=19&rft.issue=3&rft.spage=e1010963&rft.epage=e1010963&rft.pages=e1010963-e1010963&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1010963&rft_dat=%3Cgale_plos_%3EA744741131%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2802063893&rft_id=info:pmid/36917581&rft_galeid=A744741131&rft_doaj_id=oai_doaj_org_article_025e030ea1754b13914c99992fdd91dd&rfr_iscdi=true |