Supervised machine learning for microbiomics: bridging the gap between current and best practices
Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-11 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Dudek, Natasha K Chakhvadze, Mariam Kobakhidze, Saba Kantidze, Omar Gankin, Yuriy |
description | Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community. |
doi_str_mv | 10.48550/arxiv.2402.17621 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2402_17621</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2932603319</sourcerecordid><originalsourceid>FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</originalsourceid><addsrcrecordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2932603319</pqid></control><display><type>article</type><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creator><creatorcontrib>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</creatorcontrib><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2402.17621</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Best practice ; Computer Science - Learning ; Design of experiments ; Machine learning ; Quantitative Biology - Genomics ; Reproducibility ; Supervised learning</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.1016/j.mlwa.2024.100607$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.17621$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><title>arXiv.org</title><description>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</description><subject>Best practice</subject><subject>Computer Science - Learning</subject><subject>Design of experiments</subject><subject>Machine learning</subject><subject>Quantitative Biology - Genomics</subject><subject>Reproducibility</subject><subject>Supervised learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkFFLwzAUhYMgOOZ-gE8GfO5MkzRNfJOhThj44N7LbXqzZWxpTdup_95s8-nAPYfLOR8hdzmbS10U7BHijz_OuWR8npeK51dkwoXIMy05vyGzvt8xxrgqeVGICYHPscN49D029AB26wPSPUIMPmyoayM9eBvb2rdJ-ydaR99sTtawRbqBjtY4fCMGascYMQwUQpNu_UC7CHbwFvtbcu1g3-PsX6dk_fqyXiyz1cfb--J5lYEp8kwwlLopNZQoLEhVO1U6XZhaoOKWJ08bg0Yajdhw2wglhXPoQDENkAZOyf3l7Xl_1UV_gPhbnThUZw4p8XBJdLH9GlPHateOMaROFTeCK5YoGfEHV0Fh4A</recordid><startdate>20241117</startdate><enddate>20241117</enddate><creator>Dudek, Natasha K</creator><creator>Chakhvadze, Mariam</creator><creator>Kobakhidze, Saba</creator><creator>Kantidze, Omar</creator><creator>Gankin, Yuriy</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20241117</creationdate><title>Supervised machine learning for microbiomics: bridging the gap between current and best practices</title><author>Dudek, Natasha K ; Chakhvadze, Mariam ; Kobakhidze, Saba ; Kantidze, Omar ; Gankin, Yuriy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a951-30e48d78a7e3ca46bf67f859b3e62c248d899e9498eed2cd3643ffefa608aa233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Best practice</topic><topic>Computer Science - Learning</topic><topic>Design of experiments</topic><topic>Machine learning</topic><topic>Quantitative Biology - Genomics</topic><topic>Reproducibility</topic><topic>Supervised learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Dudek, Natasha K</creatorcontrib><creatorcontrib>Chakhvadze, Mariam</creatorcontrib><creatorcontrib>Kobakhidze, Saba</creatorcontrib><creatorcontrib>Kantidze, Omar</creatorcontrib><creatorcontrib>Gankin, Yuriy</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dudek, Natasha K</au><au>Chakhvadze, Mariam</au><au>Kobakhidze, Saba</au><au>Kantidze, Omar</au><au>Gankin, Yuriy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Supervised machine learning for microbiomics: bridging the gap between current and best practices</atitle><jtitle>arXiv.org</jtitle><date>2024-11-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021-2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is also a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2402.17621</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2402_17621 |
source | arXiv.org; Free E- Journals |
subjects | Best practice Computer Science - Learning Design of experiments Machine learning Quantitative Biology - Genomics Reproducibility Supervised learning |
title | Supervised machine learning for microbiomics: bridging the gap between current and best practices |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T10%3A41%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Supervised%20machine%20learning%20for%20microbiomics:%20bridging%20the%20gap%20between%20current%20and%20best%20practices&rft.jtitle=arXiv.org&rft.au=Dudek,%20Natasha%20K&rft.date=2024-11-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2402.17621&rft_dat=%3Cproquest_arxiv%3E2932603319%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2932603319&rft_id=info:pmid/&rfr_iscdi=true |