Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) h...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Gigascience 2023-01, Vol.12
Hauptverfasser: Lee, Youngro, Cappellato, Marco, Di Camillo, Barbara
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title Gigascience
container_volume 12
creator Lee, Youngro
Cappellato, Marco
Di Camillo, Barbara
description Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
doi_str_mv 10.1093/gigascience/giad083
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/gigascience/giad083</oup_id><sourcerecordid>3130974013</sourcerecordid><originalsourceid>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</originalsourceid><addsrcrecordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130974013</pqid></control><display><type>article</type><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><creator>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creator><creatorcontrib>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creatorcontrib><description>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><identifier>ISSN: 2047-217X</identifier><identifier>EISSN: 2047-217X</identifier><identifier>DOI: 10.1093/gigascience/giad083</identifier><identifier>PMID: 37882604</identifier><language>eng</language><publisher>United States: Oxford University Press</publisher><subject>Algorithms ; Bioinformatics ; Biomarkers ; Comparative studies ; Computational Biology - methods ; Disease control ; Feature selection ; Humans ; Inflammatory bowel disease ; Inflammatory bowel diseases ; Inflammatory Bowel Diseases - diagnosis ; Intestinal microflora ; Intestine ; Learning algorithms ; Machine Learning ; Microbiomes ; Microorganisms ; Multilayer perceptrons ; Performance evaluation ; Similarity ; Species classification ; Stability ; Transformations (mathematics)</subject><ispartof>Gigascience, 2023-01, Vol.12</ispartof><rights>The Author(s) 2023. Published by Oxford University Press GigaScience. 2023</rights><rights>The Author(s) 2023. Published by Oxford University Press GigaScience.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</citedby><cites>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</cites><orcidid>0000-0002-1693-7792 ; 0000-0001-8415-4688 ; 0000-0002-9483-5898</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,1598,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37882604$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><title>Gigascience</title><addtitle>Gigascience</addtitle><description>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><subject>Algorithms</subject><subject>Bioinformatics</subject><subject>Biomarkers</subject><subject>Comparative studies</subject><subject>Computational Biology - methods</subject><subject>Disease control</subject><subject>Feature selection</subject><subject>Humans</subject><subject>Inflammatory bowel disease</subject><subject>Inflammatory bowel diseases</subject><subject>Inflammatory Bowel Diseases - diagnosis</subject><subject>Intestinal microflora</subject><subject>Intestine</subject><subject>Learning algorithms</subject><subject>Machine Learning</subject><subject>Microbiomes</subject><subject>Microorganisms</subject><subject>Multilayer perceptrons</subject><subject>Performance evaluation</subject><subject>Similarity</subject><subject>Species classification</subject><subject>Stability</subject><subject>Transformations (mathematics)</subject><issn>2047-217X</issn><issn>2047-217X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Lee, Youngro</creator><creator>Cappellato, Marco</creator><creator>Di Camillo, Barbara</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid></search><sort><creationdate>20230101</creationdate><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><author>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Bioinformatics</topic><topic>Biomarkers</topic><topic>Comparative studies</topic><topic>Computational Biology - methods</topic><topic>Disease control</topic><topic>Feature selection</topic><topic>Humans</topic><topic>Inflammatory bowel disease</topic><topic>Inflammatory bowel diseases</topic><topic>Inflammatory Bowel Diseases - diagnosis</topic><topic>Intestinal microflora</topic><topic>Intestine</topic><topic>Learning algorithms</topic><topic>Machine Learning</topic><topic>Microbiomes</topic><topic>Microorganisms</topic><topic>Multilayer perceptrons</topic><topic>Performance evaluation</topic><topic>Similarity</topic><topic>Species classification</topic><topic>Stability</topic><topic>Transformations (mathematics)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Gigascience</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lee, Youngro</au><au>Cappellato, Marco</au><au>Di Camillo, Barbara</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</atitle><jtitle>Gigascience</jtitle><addtitle>Gigascience</addtitle><date>2023-01-01</date><risdate>2023</risdate><volume>12</volume><issn>2047-217X</issn><eissn>2047-217X</eissn><abstract>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</abstract><cop>United States</cop><pub>Oxford University Press</pub><pmid>37882604</pmid><doi>10.1093/gigascience/giad083</doi><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2047-217X
ispartof Gigascience, 2023-01, Vol.12
issn 2047-217X
2047-217X
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917
source Oxford Journals Open Access Collection; MEDLINE; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central
subjects Algorithms
Bioinformatics
Biomarkers
Comparative studies
Computational Biology - methods
Disease control
Feature selection
Humans
Inflammatory bowel disease
Inflammatory bowel diseases
Inflammatory Bowel Diseases - diagnosis
Intestinal microflora
Intestine
Learning algorithms
Machine Learning
Microbiomes
Microorganisms
Multilayer perceptrons
Performance evaluation
Similarity
Species classification
Stability
Transformations (mathematics)
title Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T22%3A40%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Machine%20learning%E2%80%93based%20feature%20selection%20to%20search%20stable%20microbial%20biomarkers:%20application%20to%20inflammatory%20bowel%20disease&rft.jtitle=Gigascience&rft.au=Lee,%20Youngro&rft.date=2023-01-01&rft.volume=12&rft.issn=2047-217X&rft.eissn=2047-217X&rft_id=info:doi/10.1093/gigascience/giad083&rft_dat=%3Cproquest_pubme%3E3130974013%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3130974013&rft_id=info:pmid/37882604&rft_oup_id=10.1093/gigascience/giad083&rfr_iscdi=true