Supervised machine learning for microbiomics: bridging the gap between current and best practices

Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Dudek, Natasha K, Chakhvadze, Mariam, Kobakhidze, Saba, Kantidze, Omar, Gankin, Yuriy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Dudek, Natasha K
Chakhvadze, Mariam
Kobakhidze, Saba
Kantidze, Omar
Gankin, Yuriy
description Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.
doi_str_mv 10.48550/arxiv.2402.17621
format Article
fullrecord <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2402_17621</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2932603319</sourcerecordid><originalsourceid>FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</originalsourceid><addsrcrecordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2932603319</pqid></control><display><type>article</type><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creator><creatorcontrib>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creatorcontrib><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2402.17621</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Best practice ; Computer Science - Learning ; Design of experiments ; Machine learning ; Quantitative Biology - Genomics ; Reproducibility ; Supervised learning</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.1016/j.mlwa.2024.100607$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.17621$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><title>arXiv.org</title><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><subject>Best practice</subject><subject>Computer Science - Learning</subject><subject>Design of experiments</subject><subject>Machine learning</subject><subject>Quantitative Biology - Genomics</subject><subject>Reproducibility</subject><subject>Supervised learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</recordid><startdate>20241117</startdate><enddate>20241117</enddate><creator>Dudek, Natasha K</creator><creator>Chakhvadze, Mariam</creator><creator>Kobakhidze, Saba</creator><creator>Kantidze, Omar</creator><creator>Gankin, Yuriy</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20241117</creationdate><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><author>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Best practice</topic><topic>Computer Science - Learning</topic><topic>Design of experiments</topic><topic>Machine learning</topic><topic>Quantitative Biology - Genomics</topic><topic>Reproducibility</topic><topic>Supervised learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dudek, Natasha K</au><au>Chakhvadze, Mariam</au><au>Kobakhidze, Saba</au><au>Kantidze, Omar</au><au>Gankin, Yuriy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Supervised machine learning for microbiomics: bridging the gap between current and best practices</atitle><jtitle>arXiv.org</jtitle><date>2024-11-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2402.17621</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_arxiv_primary_2402_17621
source arXiv.org; Free E- Journals
subjects Best practice
Computer Science - Learning
Design of experiments
Machine learning
Quantitative Biology - Genomics
Reproducibility
Supervised learning
title Supervised machine learning for microbiomics: bridging the gap between current and best practices
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T10%3A41%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Supervised%20machine%20learning%20for%20microbiomics:%20bridging%20the%20gap%20between%20current%20and%20best%20practices&rft.jtitle=arXiv.org&rft.au=Dudek,%20Natasha%20K&rft.date=2024-11-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2402.17621&rft_dat=%3Cproquest_arxiv%3E2932603319%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2932603319&rft_id=info:pmid/&rfr_iscdi=true