Validity problems in clinical machine learning by indirect data labeling using consensus definitions

We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hagmann, Michael, Schamoni, Shigehiko, Riezler, Stefan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Quantitative Biology - Quantitative Methods Statistics - Applications Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hagmann, Michael Schamoni, Shigehiko Riezler, Stefan
description	We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.
doi_str_mv	10.48550/arxiv.2311.03037
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_03037</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_03037</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-fdb1b9ab891472c611c0823f997b8642a8f1b12eaccbba5eff49d7fd413affb93</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpoqT9gK6qH7BrWbZlLUvoCwLdhG7NvXokF2QlSE6p_7522s0MzIGBw9iDqMqmb9vqCdIPfZe1FKKsZCXVLbNfEMjSNPNzOmFwY-YUuQkUyUDgI5gjRceDgxQpHjjOC7eUnJm4hQl4AHRhJZe8pjnF7GK-ZG6dX04mWoY7duMhZHf_3xu2f33Zb9-L3efbx_Z5V0CnVOEtCtSAvRaNqk0nhKn6WnqtFfZdU0PvBYragTGI0DrvG22Vt42Q4D1quWGPf7dXzeGcaIQ0D6vucNWVvz79Uwk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><source>arXiv.org</source><creator>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</creator><creatorcontrib>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</creatorcontrib><description>We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.</description><identifier>DOI: 10.48550/arxiv.2311.03037</identifier><language>eng</language><subject>Computer Science - Learning ; Quantitative Biology - Quantitative Methods ; Statistics - Applications ; Statistics - Machine Learning</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.03037$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.03037$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hagmann, Michael</creatorcontrib><creatorcontrib>Schamoni, Shigehiko</creatorcontrib><creatorcontrib>Riezler, Stefan</creatorcontrib><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><description>We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.</description><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Quantitative Methods</subject><subject>Statistics - Applications</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpoqT9gK6qH7BrWbZlLUvoCwLdhG7NvXokF2QlSE6p_7522s0MzIGBw9iDqMqmb9vqCdIPfZe1FKKsZCXVLbNfEMjSNPNzOmFwY-YUuQkUyUDgI5gjRceDgxQpHjjOC7eUnJm4hQl4AHRhJZe8pjnF7GK-ZG6dX04mWoY7duMhZHf_3xu2f33Zb9-L3efbx_Z5V0CnVOEtCtSAvRaNqk0nhKn6WnqtFfZdU0PvBYragTGI0DrvG22Vt42Q4D1quWGPf7dXzeGcaIQ0D6vucNWVvz79Uwk</recordid><startdate>20231106</startdate><enddate>20231106</enddate><creator>Hagmann, Michael</creator><creator>Schamoni, Shigehiko</creator><creator>Riezler, Stefan</creator><scope>AKY</scope><scope>ALC</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20231106</creationdate><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><author>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-fdb1b9ab891472c611c0823f997b8642a8f1b12eaccbba5eff49d7fd413affb93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Quantitative Methods</topic><topic>Statistics - Applications</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Hagmann, Michael</creatorcontrib><creatorcontrib>Schamoni, Shigehiko</creatorcontrib><creatorcontrib>Riezler, Stefan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hagmann, Michael</au><au>Schamoni, Shigehiko</au><au>Riezler, Stefan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</atitle><date>2023-11-06</date><risdate>2023</risdate><abstract>We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.</abstract><doi>10.48550/arxiv.2311.03037</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.03037
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_03037
source	arXiv.org
subjects	Computer Science - Learning Quantitative Biology - Quantitative Methods Statistics - Applications Statistics - Machine Learning
title	Validity problems in clinical machine learning by indirect data labeling using consensus definitions
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T13%3A38%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Validity%20problems%20in%20clinical%20machine%20learning%20by%20indirect%20data%20labeling%20using%20consensus%20definitions&rft.au=Hagmann,%20Michael&rft.date=2023-11-06&rft_id=info:doi/10.48550/arxiv.2311.03037&rft_dat=%3Carxiv_GOX%3E2311_03037%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true