Guided Discrete Diffusion for Electronic Health Record Generation
Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confiden...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Han, Jun Chen, Zixiang Li, Yongqian Kou, Yiwen Halperin, Eran Tillman, Robert E Gu, Quanquan |
description | Electronic health records (EHRs) are a pivotal data source that enables
numerous applications in computational medicine, e.g., disease progression
prediction, clinical trial design, and health economics and outcomes research.
Despite wide usability, their sensitive nature raises privacy and
confidentially concerns, which limit potential use cases. To tackle these
challenges, we explore the use of generative models to synthesize artificial,
yet realistic EHRs. While diffusion-based methods have recently demonstrated
state-of-the-art performance in generating other data modalities and overcome
the training instability and mode collapse issues that plague previous
GAN-based approaches, their applications in EHR generation remain
underexplored. The discrete nature of tabular medical code data in EHRs poses
challenges for high-quality data generation, especially for continuous
diffusion models. To this end, we introduce a novel tabular EHR generation
method, EHR-D3PM, which enables both unconditional and conditional generation
using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM
significantly outperforms existing generative baselines on comprehensive
fidelity and utility metrics while maintaining less attribute and membership
vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data
augmentation method and enhances performance on downstream tasks when combined
with real data. |
doi_str_mv | 10.48550/arxiv.2404.12314 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_12314</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_12314</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-3aea360cb5ffa189635da795b617f2d78c2178b7bdb761abaa21ebf37c3d8c573</originalsourceid><addsrcrecordid>eNotz81KAzEUBeBsupDqA7gyLzDj5Pemy1LrVCgI0v1wk9xgYJwpmano29sfV-csDgc-xh5FU2tnTPOM5Sd_11I3uhZSCX3H1u0pR4r8JU-h0EznktJpyuPA01j4tqcwl3HIge8I-_mTf1AYS-QtDVRwPu_u2SJhP9HDfy7Z4XV72Oyq_Xv7tlnvK7SgK4WEyjbBm5RQuJVVJiKsjLcCkozgghTgPPjowQr0iFKQTwqCii4YUEv2dLu9GrpjyV9YfruLpbta1B9hm0SP</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><source>arXiv.org</source><creator>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creator><creatorcontrib>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creatorcontrib><description>Electronic health records (EHRs) are a pivotal data source that enables
numerous applications in computational medicine, e.g., disease progression
prediction, clinical trial design, and health economics and outcomes research.
Despite wide usability, their sensitive nature raises privacy and
confidentially concerns, which limit potential use cases. To tackle these
challenges, we explore the use of generative models to synthesize artificial,
yet realistic EHRs. While diffusion-based methods have recently demonstrated
state-of-the-art performance in generating other data modalities and overcome
the training instability and mode collapse issues that plague previous
GAN-based approaches, their applications in EHR generation remain
underexplored. The discrete nature of tabular medical code data in EHRs poses
challenges for high-quality data generation, especially for continuous
diffusion models. To this end, we introduce a novel tabular EHR generation
method, EHR-D3PM, which enables both unconditional and conditional generation
using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM
significantly outperforms existing generative baselines on comprehensive
fidelity and utility metrics while maintaining less attribute and membership
vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data
augmentation method and enhances performance on downstream tasks when combined
with real data.</description><identifier>DOI: 10.48550/arxiv.2404.12314</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.12314$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.12314$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><description>Electronic health records (EHRs) are a pivotal data source that enables
numerous applications in computational medicine, e.g., disease progression
prediction, clinical trial design, and health economics and outcomes research.
Despite wide usability, their sensitive nature raises privacy and
confidentially concerns, which limit potential use cases. To tackle these
challenges, we explore the use of generative models to synthesize artificial,
yet realistic EHRs. While diffusion-based methods have recently demonstrated
state-of-the-art performance in generating other data modalities and overcome
the training instability and mode collapse issues that plague previous
GAN-based approaches, their applications in EHR generation remain
underexplored. The discrete nature of tabular medical code data in EHRs poses
challenges for high-quality data generation, especially for continuous
diffusion models. To this end, we introduce a novel tabular EHR generation
method, EHR-D3PM, which enables both unconditional and conditional generation
using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM
significantly outperforms existing generative baselines on comprehensive
fidelity and utility metrics while maintaining less attribute and membership
vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data
augmentation method and enhances performance on downstream tasks when combined
with real data.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KAzEUBeBsupDqA7gyLzDj5Pemy1LrVCgI0v1wk9xgYJwpmano29sfV-csDgc-xh5FU2tnTPOM5Sd_11I3uhZSCX3H1u0pR4r8JU-h0EznktJpyuPA01j4tqcwl3HIge8I-_mTf1AYS-QtDVRwPu_u2SJhP9HDfy7Z4XV72Oyq_Xv7tlnvK7SgK4WEyjbBm5RQuJVVJiKsjLcCkozgghTgPPjowQr0iFKQTwqCii4YUEv2dLu9GrpjyV9YfruLpbta1B9hm0SP</recordid><startdate>20240418</startdate><enddate>20240418</enddate><creator>Han, Jun</creator><creator>Chen, Zixiang</creator><creator>Li, Yongqian</creator><creator>Kou, Yiwen</creator><creator>Halperin, Eran</creator><creator>Tillman, Robert E</creator><creator>Gu, Quanquan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240418</creationdate><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><author>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-3aea360cb5ffa189635da795b617f2d78c2178b7bdb761abaa21ebf37c3d8c573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Han, Jun</au><au>Chen, Zixiang</au><au>Li, Yongqian</au><au>Kou, Yiwen</au><au>Halperin, Eran</au><au>Tillman, Robert E</au><au>Gu, Quanquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Guided Discrete Diffusion for Electronic Health Record Generation</atitle><date>2024-04-18</date><risdate>2024</risdate><abstract>Electronic health records (EHRs) are a pivotal data source that enables
numerous applications in computational medicine, e.g., disease progression
prediction, clinical trial design, and health economics and outcomes research.
Despite wide usability, their sensitive nature raises privacy and
confidentially concerns, which limit potential use cases. To tackle these
challenges, we explore the use of generative models to synthesize artificial,
yet realistic EHRs. While diffusion-based methods have recently demonstrated
state-of-the-art performance in generating other data modalities and overcome
the training instability and mode collapse issues that plague previous
GAN-based approaches, their applications in EHR generation remain
underexplored. The discrete nature of tabular medical code data in EHRs poses
challenges for high-quality data generation, especially for continuous
diffusion models. To this end, we introduce a novel tabular EHR generation
method, EHR-D3PM, which enables both unconditional and conditional generation
using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM
significantly outperforms existing generative baselines on comprehensive
fidelity and utility metrics while maintaining less attribute and membership
vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data
augmentation method and enhances performance on downstream tasks when combined
with real data.</abstract><doi>10.48550/arxiv.2404.12314</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2404.12314 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2404_12314 |
source | arXiv.org |
subjects | Computer Science - Learning |
title | Guided Discrete Diffusion for Electronic Health Record Generation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T14%3A30%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Guided%20Discrete%20Diffusion%20for%20Electronic%20Health%20Record%20Generation&rft.au=Han,%20Jun&rft.date=2024-04-18&rft_id=info:doi/10.48550/arxiv.2404.12314&rft_dat=%3Carxiv_GOX%3E2404_12314%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |