Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) h...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Gigascience 2023-01, Vol.12
Hauptverfasser:	Lee, Youngro, Cappellato, Marco, Di Camillo, Barbara
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Bioinformatics Biomarkers Comparative studies Computational Biology - methods Disease control Feature selection Humans Inflammatory bowel disease Inflammatory bowel diseases Inflammatory Bowel Diseases - diagnosis Intestinal microflora Intestine Learning algorithms Machine Learning Microbiomes Microorganisms Multilayer perceptrons Performance evaluation Similarity Species classification Stability Transformations (mathematics)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	Gigascience
container_volume	12
creator	Lee, Youngro Cappellato, Marco Di Camillo, Barbara
description	Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
doi_str_mv	10.1093/gigascience/giad083
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/gigascience/giad083</oup_id><sourcerecordid>3130974013</sourcerecordid><originalsourceid>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</originalsourceid><addsrcrecordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130974013</pqid></control><display><type>article</type><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><creator>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creator><creatorcontrib>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</creatorcontrib><description>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><identifier>ISSN: 2047-217X</identifier><identifier>EISSN: 2047-217X</identifier><identifier>DOI: 10.1093/gigascience/giad083</identifier><identifier>PMID: 37882604</identifier><language>eng</language><publisher>United States: Oxford University Press</publisher><subject>Algorithms ; Bioinformatics ; Biomarkers ; Comparative studies ; Computational Biology - methods ; Disease control ; Feature selection ; Humans ; Inflammatory bowel disease ; Inflammatory bowel diseases ; Inflammatory Bowel Diseases - diagnosis ; Intestinal microflora ; Intestine ; Learning algorithms ; Machine Learning ; Microbiomes ; Microorganisms ; Multilayer perceptrons ; Performance evaluation ; Similarity ; Species classification ; Stability ; Transformations (mathematics)</subject><ispartof>Gigascience, 2023-01, Vol.12</ispartof><rights>The Author(s) 2023. Published by Oxford University Press GigaScience. 2023</rights><rights>The Author(s) 2023. Published by Oxford University Press GigaScience.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</citedby><cites>FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</cites><orcidid>0000-0002-1693-7792 ; 0000-0001-8415-4688 ; 0000-0002-9483-5898</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,1598,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37882604$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><title>Gigascience</title><addtitle>Gigascience</addtitle><description>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</description><subject>Algorithms</subject><subject>Bioinformatics</subject><subject>Biomarkers</subject><subject>Comparative studies</subject><subject>Computational Biology - methods</subject><subject>Disease control</subject><subject>Feature selection</subject><subject>Humans</subject><subject>Inflammatory bowel disease</subject><subject>Inflammatory bowel diseases</subject><subject>Inflammatory Bowel Diseases - diagnosis</subject><subject>Intestinal microflora</subject><subject>Intestine</subject><subject>Learning algorithms</subject><subject>Machine Learning</subject><subject>Microbiomes</subject><subject>Microorganisms</subject><subject>Multilayer perceptrons</subject><subject>Performance evaluation</subject><subject>Similarity</subject><subject>Species classification</subject><subject>Stability</subject><subject>Transformations (mathematics)</subject><issn>2047-217X</issn><issn>2047-217X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNUc1qFTEYDaLY0vYJBAm4cXNrfsZJxo1I8Q9aulFwF75kvrk3NZOMyYxS3PgOvqFPYsq9LVdXzSYn5JzDd75DyBPOTjnr5Iu1X0NxHqPDiqFnWj4gh4I1aiW4-vJwDx-Qk1KuWD1Kaa3kY3IgKxAtaw7JzwtwGx-RBoQcfVz_-fXbQsGeDgjzkpEWDOhmnyKdU31AdhtaZrAB6ehdTtZDoNanEfJXzOUVhWkK3sGtxMchwDjCnPI1tekHBtr76lPwmDwaIBQ82d1H5PO7t5_OPqzOL99_PHtzvnKNkvNKW92xDvuGST1IMQzCdhLaFlunWisle8n40LRcCN04aLUTrHWDUKrXNSIHeUReb32nxY7YO4xzhmCm7OvM1yaBN__-RL8x6_TdcNYy1nFVHZ7vHHL6tmCZzeiLwxAgYlqKEXWdUjRdJyv12X_Uq7TkWPMZySXrVMP4DUtuWXWBpWQc7qbhzNwUbPYKNruCq-rpfpA7zW2dlXC6JaRlupfjX3QLuW4</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Lee, Youngro</creator><creator>Cappellato, Marco</creator><creator>Di Camillo, Barbara</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid></search><sort><creationdate>20230101</creationdate><title>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</title><author>Lee, Youngro ; Cappellato, Marco ; Di Camillo, Barbara</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c473t-8b8909ed4038f32ff2b93a66e6c76b330501f4612284ca68c206cf277d86041a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Bioinformatics</topic><topic>Biomarkers</topic><topic>Comparative studies</topic><topic>Computational Biology - methods</topic><topic>Disease control</topic><topic>Feature selection</topic><topic>Humans</topic><topic>Inflammatory bowel disease</topic><topic>Inflammatory bowel diseases</topic><topic>Inflammatory Bowel Diseases - diagnosis</topic><topic>Intestinal microflora</topic><topic>Intestine</topic><topic>Learning algorithms</topic><topic>Machine Learning</topic><topic>Microbiomes</topic><topic>Microorganisms</topic><topic>Multilayer perceptrons</topic><topic>Performance evaluation</topic><topic>Similarity</topic><topic>Species classification</topic><topic>Stability</topic><topic>Transformations (mathematics)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lee, Youngro</creatorcontrib><creatorcontrib>Cappellato, Marco</creatorcontrib><creatorcontrib>Di Camillo, Barbara</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Gigascience</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lee, Youngro</au><au>Cappellato, Marco</au><au>Di Camillo, Barbara</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease</atitle><jtitle>Gigascience</jtitle><addtitle>Gigascience</addtitle><date>2023-01-01</date><risdate>2023</risdate><volume>12</volume><issn>2047-217X</issn><eissn>2047-217X</eissn><abstract>Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.</abstract><cop>United States</cop><pub>Oxford University Press</pub><pmid>37882604</pmid><doi>10.1093/gigascience/giad083</doi><orcidid>https://orcid.org/0000-0002-1693-7792</orcidid><orcidid>https://orcid.org/0000-0001-8415-4688</orcidid><orcidid>https://orcid.org/0000-0002-9483-5898</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2047-217X
ispartof	Gigascience, 2023-01, Vol.12
issn	2047-217X 2047-217X
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10600917
source	Oxford Journals Open Access Collection; MEDLINE; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central
subjects	Algorithms Bioinformatics Biomarkers Comparative studies Computational Biology - methods Disease control Feature selection Humans Inflammatory bowel disease Inflammatory bowel diseases Inflammatory Bowel Diseases - diagnosis Intestinal microflora Intestine Learning algorithms Machine Learning Microbiomes Microorganisms Multilayer perceptrons Performance evaluation Similarity Species classification Stability Transformations (mathematics)
title	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T22%3A40%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Machine%20learning%E2%80%93based%20feature%20selection%20to%20search%20stable%20microbial%20biomarkers:%20application%20to%20inflammatory%20bowel%20disease&rft.jtitle=Gigascience&rft.au=Lee,%20Youngro&rft.date=2023-01-01&rft.volume=12&rft.issn=2047-217X&rft.eissn=2047-217X&rft_id=info:doi/10.1093/gigascience/giad083&rft_dat=%3Cproquest_pubme%3E3130974013%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3130974013&rft_id=info:pmid/37882604&rft_oup_id=10.1093/gigascience/giad083&rfr_iscdi=true