Guided Discrete Diffusion for Electronic Health Record Generation
Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confiden...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-06 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Han, Jun Chen, Zixiang Li, Yongqian Kou, Yiwen Halperin, Eran Tillman, Robert E Gu, Quanquan |
description | Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3041589649</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3041589649</sourcerecordid><originalsourceid>FETCH-proquest_journals_30415896493</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQW5qamxyjTc3SXMb_hRLb6tv3_dugHdHoP3tuRjAtRFm3F-YHk3q-MMd5ceF2LjFyHaGaY6d14hRAgidbRG2epdkj7DVRAZ42iI8gtLPQJyuFMB7CAMqTvRPZabh7yH4_k_Ohft7F4o_tE8GFaXUSb0iRYVdZt11Sd-O_6AllvOUQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3041589649</pqid></control><display><type>article</type><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><source>Free E- Journals</source><creator>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creator><creatorcontrib>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creatorcontrib><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Data augmentation ; Electronic health records ; Medical research</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><title>arXiv.org</title><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><subject>Data augmentation</subject><subject>Electronic health records</subject><subject>Medical research</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQW5qamxyjTc3SXMb_hRLb6tv3_dugHdHoP3tuRjAtRFm3F-YHk3q-MMd5ceF2LjFyHaGaY6d14hRAgidbRG2epdkj7DVRAZ42iI8gtLPQJyuFMB7CAMqTvRPZabh7yH4_k_Ohft7F4o_tE8GFaXUSb0iRYVdZt11Sd-O_6AllvOUQ</recordid><startdate>20240614</startdate><enddate>20240614</enddate><creator>Han, Jun</creator><creator>Chen, Zixiang</creator><creator>Li, Yongqian</creator><creator>Kou, Yiwen</creator><creator>Halperin, Eran</creator><creator>Tillman, Robert E</creator><creator>Gu, Quanquan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240614</creationdate><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><author>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30415896493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data augmentation</topic><topic>Electronic health records</topic><topic>Medical research</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Han, Jun</au><au>Chen, Zixiang</au><au>Li, Yongqian</au><au>Kou, Yiwen</au><au>Halperin, Eran</au><au>Tillman, Robert E</au><au>Gu, Quanquan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Guided Discrete Diffusion for Electronic Health Record Generation</atitle><jtitle>arXiv.org</jtitle><date>2024-06-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-06 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3041589649 |
source | Free E- Journals |
subjects | Data augmentation Electronic health records Medical research |
title | Guided Discrete Diffusion for Electronic Health Record Generation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A53%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Guided%20Discrete%20Diffusion%20for%20Electronic%20Health%20Record%20Generation&rft.jtitle=arXiv.org&rft.au=Han,%20Jun&rft.date=2024-06-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3041589649%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3041589649&rft_id=info:pmid/&rfr_iscdi=true |