Supervised machine learning for microbiomics: bridging the gap between current and best practices

Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Dudek, Natasha K, Chakhvadze, Mariam, Kobakhidze, Saba, Kantidze, Omar, Gankin, Yuriy
Format:	Artikel
Sprache:	eng
Schlagworte:	Best practice Computer Science - Learning Design of experiments Machine learning Quantitative Biology - Genomics Reproducibility Supervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Dudek, Natasha K Chakhvadze, Mariam Kobakhidze, Saba Kantidze, Omar Gankin, Yuriy
description	Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.
doi_str_mv	10.48550/arxiv.2402.17621
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2402_17621</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2932603319</sourcerecordid><originalsourceid>FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</originalsourceid><addsrcrecordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2932603319</pqid></control><display><type>article</type><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creator><creatorcontrib>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creatorcontrib><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2402.17621</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Best practice ; Computer Science - Learning ; Design of experiments ; Machine learning ; Quantitative Biology - Genomics ; Reproducibility ; Supervised learning</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.1016/j.mlwa.2024.100607$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.17621$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><title>arXiv.org</title><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><subject>Best practice</subject><subject>Computer Science - Learning</subject><subject>Design of experiments</subject><subject>Machine learning</subject><subject>Quantitative Biology - Genomics</subject><subject>Reproducibility</subject><subject>Supervised learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</recordid><startdate>20241117</startdate><enddate>20241117</enddate><creator>Dudek, Natasha K</creator><creator>Chakhvadze, Mariam</creator><creator>Kobakhidze, Saba</creator><creator>Kantidze, Omar</creator><creator>Gankin, Yuriy</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20241117</creationdate><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><author>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Best practice</topic><topic>Computer Science - Learning</topic><topic>Design of experiments</topic><topic>Machine learning</topic><topic>Quantitative Biology - Genomics</topic><topic>Reproducibility</topic><topic>Supervised learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dudek, Natasha K</au><au>Chakhvadze, Mariam</au><au>Kobakhidze, Saba</au><au>Kantidze, Omar</au><au>Gankin, Yuriy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Supervised machine learning for microbiomics: bridging the gap between current and best practices</atitle><jtitle>arXiv.org</jtitle><date>2024-11-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2402.17621</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2402_17621
source	arXiv.org; Free E- Journals
subjects	Best practice Computer Science - Learning Design of experiments Machine learning Quantitative Biology - Genomics Reproducibility Supervised learning
title	Supervised machine learning for microbiomics: bridging the gap between current and best practices
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T10%3A41%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Supervised%20machine%20learning%20for%20microbiomics:%20bridging%20the%20gap%20between%20current%20and%20best%20practices&rft.jtitle=arXiv.org&rft.au=Dudek,%20Natasha%20K&rft.date=2024-11-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2402.17621&rft_dat=%3Cproquest_arxiv%3E2932603319%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2932603319&rft_id=info:pmid/&rfr_iscdi=true