Guided Discrete Diffusion for Electronic Health Record Generation

Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confiden...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-06
Hauptverfasser: Han, Jun, Chen, Zixiang, Li, Yongqian, Kou, Yiwen, Halperin, Eran, Tillman, Robert E, Gu, Quanquan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Han, Jun
Chen, Zixiang
Li, Yongqian
Kou, Yiwen
Halperin, Eran
Tillman, Robert E
Gu, Quanquan
description Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3041589649</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3041589649</sourcerecordid><originalsourceid>FETCH-proquest_journals_30415896493</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQW5qamxyjTc3SXMb_hRLb6tv3_dugHdHoP3tuRjAtRFm3F-YHk3q-MMd5ceF2LjFyHaGaY6d14hRAgidbRG2epdkj7DVRAZ42iI8gtLPQJyuFMB7CAMqTvRPZabh7yH4_k_Ohft7F4o_tE8GFaXUSb0iRYVdZt11Sd-O_6AllvOUQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3041589649</pqid></control><display><type>article</type><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><source>Free E- Journals</source><creator>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creator><creatorcontrib>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creatorcontrib><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Data augmentation ; Electronic health records ; Medical research</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><title>arXiv.org</title><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><subject>Data augmentation</subject><subject>Electronic health records</subject><subject>Medical research</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQW5qamxyjTc3SXMb_hRLb6tv3_dugHdHoP3tuRjAtRFm3F-YHk3q-MMd5ceF2LjFyHaGaY6d14hRAgidbRG2epdkj7DVRAZ42iI8gtLPQJyuFMB7CAMqTvRPZabh7yH4_k_Ohft7F4o_tE8GFaXUSb0iRYVdZt11Sd-O_6AllvOUQ</recordid><startdate>20240614</startdate><enddate>20240614</enddate><creator>Han, Jun</creator><creator>Chen, Zixiang</creator><creator>Li, Yongqian</creator><creator>Kou, Yiwen</creator><creator>Halperin, Eran</creator><creator>Tillman, Robert E</creator><creator>Gu, Quanquan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240614</creationdate><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><author>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30415896493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data augmentation</topic><topic>Electronic health records</topic><topic>Medical research</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Han, Jun</au><au>Chen, Zixiang</au><au>Li, Yongqian</au><au>Kou, Yiwen</au><au>Halperin, Eran</au><au>Tillman, Robert E</au><au>Gu, Quanquan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Guided Discrete Diffusion for Electronic Health Record Generation</atitle><jtitle>arXiv.org</jtitle><date>2024-06-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_3041589649
source Free E- Journals
subjects Data augmentation
Electronic health records
Medical research
title Guided Discrete Diffusion for Electronic Health Record Generation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A53%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Guided%20Discrete%20Diffusion%20for%20Electronic%20Health%20Record%20Generation&rft.jtitle=arXiv.org&rft.au=Han,%20Jun&rft.date=2024-06-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3041589649%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3041589649&rft_id=info:pmid/&rfr_iscdi=true