Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) h...
Gespeichert in:
Veröffentlicht in: | Gigascience 2023-01, Vol.12 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | Gigascience |
container_volume | 12 |
creator | Lee, Youngro Cappellato, Marco Di Camillo, Barbara |
description | Abstract
Background
Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance.
Results
We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations.
Conclusion
Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies. |
doi_str_mv | 10.1093/gigascience/giad083 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/gigascience/giad083</oup_id><sourcerecordid>3130974013</sourcerecordid><originalsourceid>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</originalsourceid><addsrcrecordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130974013</pqid></control><display><type>article</type><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><creator>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creator><creatorcontrib>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creatorcontrib><description>Abstract
Background
Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance.
Results
We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations.
Conclusion
Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><identifier>ISSN: 2047-217X</identifier><identifier>EISSN: 2047-217X</identifier><identifier>DOI: 10.1093/gigascience/giad083</identifier><identifier>PMID: 37882604</identifier><language>eng</language><publisher>United States: Oxford University Press</publisher><subject>Algorithms ; Bioinformatics ; Biomarkers ; Comparative studies ; Computational Biology - methods ; Disease control ; Feature selection ; Humans ; Inflammatory bowel disease ; Inflammatory bowel diseases ; Inflammatory Bowel Diseases - diagnosis ; Intestinal microflora ; Intestine ; Learning algorithms ; Machine Learning ; Microbiomes ; Microorganisms ; Multilayer perceptrons ; Performance evaluation ; Similarity ; Species classification ; Stability ; Transformations (mathematics)</subject><ispartof>Gigascience, 2023-01, Vol.12</ispartof><rights>The Author(s) 2023. Published by Oxford University Press GigaScience. 2023</rights><rights>The Author(s) 2023. Published by Oxford University Press GigaScience.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</citedby><cites>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</cites><orcidid>0000-0002-1693-7792 ; 0000-0001-8415-4688 ; 0000-0002-9483-5898</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,1598,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37882604$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><title>Gigascience</title><addtitle>Gigascience</addtitle><description>Abstract
Background
Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance.
Results
We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations.
Conclusion
Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><subject>Algorithms</subject><subject>Bioinformatics</subject><subject>Biomarkers</subject><subject>Comparative studies</subject><subject>Computational Biology - methods</subject><subject>Disease control</subject><subject>Feature selection</subject><subject>Humans</subject><subject>Inflammatory bowel disease</subject><subject>Inflammatory bowel diseases</subject><subject>Inflammatory Bowel Diseases - diagnosis</subject><subject>Intestinal microflora</subject><subject>Intestine</subject><subject>Learning algorithms</subject><subject>Machine Learning</subject><subject>Microbiomes</subject><subject>Microorganisms</subject><subject>Multilayer perceptrons</subject><subject>Performance evaluation</subject><subject>Similarity</subject><subject>Species classification</subject><subject>Stability</subject><subject>Transformations (mathematics)</subject><issn>2047-217X</issn><issn>2047-217X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Lee, Youngro</creator><creator>Cappellato, Marco</creator><creator>Di Camillo, Barbara</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid></search><sort><creationdate>20230101</creationdate><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><author>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Bioinformatics</topic><topic>Biomarkers</topic><topic>Comparative studies</topic><topic>Computational Biology - methods</topic><topic>Disease control</topic><topic>Feature selection</topic><topic>Humans</topic><topic>Inflammatory bowel disease</topic><topic>Inflammatory bowel diseases</topic><topic>Inflammatory Bowel Diseases - diagnosis</topic><topic>Intestinal microflora</topic><topic>Intestine</topic><topic>Learning algorithms</topic><topic>Machine Learning</topic><topic>Microbiomes</topic><topic>Microorganisms</topic><topic>Multilayer perceptrons</topic><topic>Performance evaluation</topic><topic>Similarity</topic><topic>Species classification</topic><topic>Stability</topic><topic>Transformations (mathematics)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Gigascience</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lee, Youngro</au><au>Cappellato, Marco</au><au>Di Camillo, Barbara</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</atitle><jtitle>Gigascience</jtitle><addtitle>Gigascience</addtitle><date>2023-01-01</date><risdate>2023</risdate><volume>12</volume><issn>2047-217X</issn><eissn>2047-217X</eissn><abstract>Abstract
Background
Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance.
Results
We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations.
Conclusion
Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</abstract><cop>United States</cop><pub>Oxford University Press</pub><pmid>37882604</pmid><doi>10.1093/gigascience/giad083</doi><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2047-217X |
ispartof | Gigascience, 2023-01, Vol.12 |
issn | 2047-217X 2047-217X |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917 |
source | Oxford Journals Open Access Collection; MEDLINE; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central |
subjects | Algorithms Bioinformatics Biomarkers Comparative studies Computational Biology - methods Disease control Feature selection Humans Inflammatory bowel disease Inflammatory bowel diseases Inflammatory Bowel Diseases - diagnosis Intestinal microflora Intestine Learning algorithms Machine Learning Microbiomes Microorganisms Multilayer perceptrons Performance evaluation Similarity Species classification Stability Transformations (mathematics) |
title | Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T22%3A40%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Machine%20learning%E2%80%93based%20feature%20selection%20to%20search%20stable%20microbial%20biomarkers:%20application%20to%20inflammatory%20bowel%20disease&rft.jtitle=Gigascience&rft.au=Lee,%20Youngro&rft.date=2023-01-01&rft.volume=12&rft.issn=2047-217X&rft.eissn=2047-217X&rft_id=info:doi/10.1093/gigascience/giad083&rft_dat=%3Cproquest_pubme%3E3130974013%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3130974013&rft_id=info:pmid/37882604&rft_oup_id=10.1093/gigascience/giad083&rfr_iscdi=true |