Validity problems in clinical machine learning by indirect data labeling using consensus definitions
We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included i...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Hagmann, Michael Schamoni, Shigehiko Riezler, Stefan |
description | We demonstrate a validity problem of machine learning in the vital
application area of disease diagnosis in medicine. It arises when target labels
in training data are determined by an indirect measurement, and the fundamental
measurements needed to determine this indirect measurement are included in the
input data representation. Machine learning models trained on this data will
learn nothing else but to exactly reconstruct the known target definition. Such
models show perfect performance on similarly constructed test data but will
fail catastrophically on real-world examples where the defining fundamental
measurements are not or only incompletely available. We present a general
procedure allowing identification of problematic datasets and black-box machine
learning models trained on them, and exemplify our detection procedure on the
task of early prediction of sepsis. |
doi_str_mv | 10.48550/arxiv.2311.03037 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_03037</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_03037</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-fdb1b9ab891472c611c0823f997b8642a8f1b12eaccbba5eff49d7fd413affb93</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpoqT9gK6qH7BrWbZlLUvoCwLdhG7NvXokF2QlSE6p_7522s0MzIGBw9iDqMqmb9vqCdIPfZe1FKKsZCXVLbNfEMjSNPNzOmFwY-YUuQkUyUDgI5gjRceDgxQpHjjOC7eUnJm4hQl4AHRhJZe8pjnF7GK-ZG6dX04mWoY7duMhZHf_3xu2f33Zb9-L3efbx_Z5V0CnVOEtCtSAvRaNqk0nhKn6WnqtFfZdU0PvBYragTGI0DrvG22Vt42Q4D1quWGPf7dXzeGcaIQ0D6vucNWVvz79Uwk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><source>arXiv.org</source><creator>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</creator><creatorcontrib>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</creatorcontrib><description>We demonstrate a validity problem of machine learning in the vital
application area of disease diagnosis in medicine. It arises when target labels
in training data are determined by an indirect measurement, and the fundamental
measurements needed to determine this indirect measurement are included in the
input data representation. Machine learning models trained on this data will
learn nothing else but to exactly reconstruct the known target definition. Such
models show perfect performance on similarly constructed test data but will
fail catastrophically on real-world examples where the defining fundamental
measurements are not or only incompletely available. We present a general
procedure allowing identification of problematic datasets and black-box machine
learning models trained on them, and exemplify our detection procedure on the
task of early prediction of sepsis.</description><identifier>DOI: 10.48550/arxiv.2311.03037</identifier><language>eng</language><subject>Computer Science - Learning ; Quantitative Biology - Quantitative Methods ; Statistics - Applications ; Statistics - Machine Learning</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.03037$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.03037$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hagmann, Michael</creatorcontrib><creatorcontrib>Schamoni, Shigehiko</creatorcontrib><creatorcontrib>Riezler, Stefan</creatorcontrib><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><description>We demonstrate a validity problem of machine learning in the vital
application area of disease diagnosis in medicine. It arises when target labels
in training data are determined by an indirect measurement, and the fundamental
measurements needed to determine this indirect measurement are included in the
input data representation. Machine learning models trained on this data will
learn nothing else but to exactly reconstruct the known target definition. Such
models show perfect performance on similarly constructed test data but will
fail catastrophically on real-world examples where the defining fundamental
measurements are not or only incompletely available. We present a general
procedure allowing identification of problematic datasets and black-box machine
learning models trained on them, and exemplify our detection procedure on the
task of early prediction of sepsis.</description><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Quantitative Methods</subject><subject>Statistics - Applications</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpoqT9gK6qH7BrWbZlLUvoCwLdhG7NvXokF2QlSE6p_7522s0MzIGBw9iDqMqmb9vqCdIPfZe1FKKsZCXVLbNfEMjSNPNzOmFwY-YUuQkUyUDgI5gjRceDgxQpHjjOC7eUnJm4hQl4AHRhJZe8pjnF7GK-ZG6dX04mWoY7duMhZHf_3xu2f33Zb9-L3efbx_Z5V0CnVOEtCtSAvRaNqk0nhKn6WnqtFfZdU0PvBYragTGI0DrvG22Vt42Q4D1quWGPf7dXzeGcaIQ0D6vucNWVvz79Uwk</recordid><startdate>20231106</startdate><enddate>20231106</enddate><creator>Hagmann, Michael</creator><creator>Schamoni, Shigehiko</creator><creator>Riezler, Stefan</creator><scope>AKY</scope><scope>ALC</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20231106</creationdate><title>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</title><author>Hagmann, Michael ; Schamoni, Shigehiko ; Riezler, Stefan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-fdb1b9ab891472c611c0823f997b8642a8f1b12eaccbba5eff49d7fd413affb93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Quantitative Methods</topic><topic>Statistics - Applications</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Hagmann, Michael</creatorcontrib><creatorcontrib>Schamoni, Shigehiko</creatorcontrib><creatorcontrib>Riezler, Stefan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hagmann, Michael</au><au>Schamoni, Shigehiko</au><au>Riezler, Stefan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Validity problems in clinical machine learning by indirect data labeling using consensus definitions</atitle><date>2023-11-06</date><risdate>2023</risdate><abstract>We demonstrate a validity problem of machine learning in the vital
application area of disease diagnosis in medicine. It arises when target labels
in training data are determined by an indirect measurement, and the fundamental
measurements needed to determine this indirect measurement are included in the
input data representation. Machine learning models trained on this data will
learn nothing else but to exactly reconstruct the known target definition. Such
models show perfect performance on similarly constructed test data but will
fail catastrophically on real-world examples where the defining fundamental
measurements are not or only incompletely available. We present a general
procedure allowing identification of problematic datasets and black-box machine
learning models trained on them, and exemplify our detection procedure on the
task of early prediction of sepsis.</abstract><doi>10.48550/arxiv.2311.03037</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2311.03037 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2311_03037 |
source | arXiv.org |
subjects | Computer Science - Learning Quantitative Biology - Quantitative Methods Statistics - Applications Statistics - Machine Learning |
title | Validity problems in clinical machine learning by indirect data labeling using consensus definitions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T13%3A38%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Validity%20problems%20in%20clinical%20machine%20learning%20by%20indirect%20data%20labeling%20using%20consensus%20definitions&rft.au=Hagmann,%20Michael&rft.date=2023-11-06&rft_id=info:doi/10.48550/arxiv.2311.03037&rft_dat=%3Carxiv_GOX%3E2311_03037%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |