NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis

This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Al-Ars, Zaid, Agba, Obinna, Guo, Zhuoran, Boerkamp, Christiaan, Jaber, Ziyaad, Jaber, Tareq
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Al-Ars, Zaid
Agba, Obinna
Guo, Zhuoran
Boerkamp, Christiaan
Jaber, Ziyaad
Jaber, Tareq
description This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.
doi_str_mv 10.48550/arxiv.2401.13756
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_13756</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_13756</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-48cf5ad5371bfb8f32d02cc8ec625d7a14f90a54edcf328a426ffc0785ef6a9a3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QIKT2I7LDqWhrRQegu6jm-traikkyLEq-veEtqtZHM1oDmN3mUilUUo8QPj1hzSXIkuzolT6mnWvzbaqH_nncYh7ih75C1mP0PMPwjFYvqaBAkQ_DtyNgdfOEUZ_IP4e_DeEI98Q9HGPEIiv_EwDDdHP_ZWHr2Gc_HTDrhz0E91ecsF2z_Wu2iTN23pbPTUJ6FIn0qBTYFVRZp3rjCtyK3JEQ6hzZUvIpFsKUJIszsyAzLVzKEqjyGlYQrFg9-fZk2T7c77X_su2J9niD1vUULI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis</title><source>arXiv.org</source><creator>Al-Ars, Zaid ; Agba, Obinna ; Guo, Zhuoran ; Boerkamp, Christiaan ; Jaber, Ziyaad ; Jaber, Tareq</creator><creatorcontrib>Al-Ars, Zaid ; Agba, Obinna ; Guo, Zhuoran ; Boerkamp, Christiaan ; Jaber, Ziyaad ; Jaber, Tareq</creatorcontrib><description>This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.</description><identifier>DOI: 10.48550/arxiv.2401.13756</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.13756$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.13756$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Al-Ars, Zaid</creatorcontrib><creatorcontrib>Agba, Obinna</creatorcontrib><creatorcontrib>Guo, Zhuoran</creatorcontrib><creatorcontrib>Boerkamp, Christiaan</creatorcontrib><creatorcontrib>Jaber, Ziyaad</creatorcontrib><creatorcontrib>Jaber, Tareq</creatorcontrib><title>NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis</title><description>This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QIKT2I7LDqWhrRQegu6jm-traikkyLEq-veEtqtZHM1oDmN3mUilUUo8QPj1hzSXIkuzolT6mnWvzbaqH_nncYh7ih75C1mP0PMPwjFYvqaBAkQ_DtyNgdfOEUZ_IP4e_DeEI98Q9HGPEIiv_EwDDdHP_ZWHr2Gc_HTDrhz0E91ecsF2z_Wu2iTN23pbPTUJ6FIn0qBTYFVRZp3rjCtyK3JEQ6hzZUvIpFsKUJIszsyAzLVzKEqjyGlYQrFg9-fZk2T7c77X_su2J9niD1vUULI</recordid><startdate>20240124</startdate><enddate>20240124</enddate><creator>Al-Ars, Zaid</creator><creator>Agba, Obinna</creator><creator>Guo, Zhuoran</creator><creator>Boerkamp, Christiaan</creator><creator>Jaber, Ziyaad</creator><creator>Jaber, Tareq</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240124</creationdate><title>NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis</title><author>Al-Ars, Zaid ; Agba, Obinna ; Guo, Zhuoran ; Boerkamp, Christiaan ; Jaber, Ziyaad ; Jaber, Tareq</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-48cf5ad5371bfb8f32d02cc8ec625d7a14f90a54edcf328a426ffc0785ef6a9a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Al-Ars, Zaid</creatorcontrib><creatorcontrib>Agba, Obinna</creatorcontrib><creatorcontrib>Guo, Zhuoran</creatorcontrib><creatorcontrib>Boerkamp, Christiaan</creatorcontrib><creatorcontrib>Jaber, Ziyaad</creatorcontrib><creatorcontrib>Jaber, Tareq</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Al-Ars, Zaid</au><au>Agba, Obinna</au><au>Guo, Zhuoran</au><au>Boerkamp, Christiaan</au><au>Jaber, Ziyaad</au><au>Jaber, Tareq</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis</atitle><date>2024-01-24</date><risdate>2024</risdate><abstract>This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.</abstract><doi>10.48550/arxiv.2401.13756</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2401.13756
ispartof
issn
language eng
recordid cdi_arxiv_primary_2401_13756
source arXiv.org
subjects Computer Science - Learning
title NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T08%3A20%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NLICE:%20Synthetic%20Medical%20Record%20Generation%20for%20Effective%20Primary%20Healthcare%20Differential%20Diagnosis&rft.au=Al-Ars,%20Zaid&rft.date=2024-01-24&rft_id=info:doi/10.48550/arxiv.2401.13756&rft_dat=%3Carxiv_GOX%3E2401_13756%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true