Guided Discrete Diffusion for Electronic Health Record Generation

Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confiden...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Han, Jun, Chen, Zixiang, Li, Yongqian, Kou, Yiwen, Halperin, Eran, Tillman, Robert E, Gu, Quanquan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Han, Jun
Chen, Zixiang
Li, Yongqian
Kou, Yiwen
Halperin, Eran
Tillman, Robert E
Gu, Quanquan
description Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.
doi_str_mv 10.48550/arxiv.2404.12314
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_12314</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_12314</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-3aea360cb5ffa189635da795b617f2d78c2178b7bdb761abaa21ebf37c3d8c573</originalsourceid><addsrcrecordid>eNotz81KAzEUBeBsupDqA7gyLzDj5Pemy1LrVCgI0v1wk9xgYJwpmano29sfV-csDgc-xh5FU2tnTPOM5Sd_11I3uhZSCX3H1u0pR4r8JU-h0EznktJpyuPA01j4tqcwl3HIge8I-_mTf1AYS-QtDVRwPu_u2SJhP9HDfy7Z4XV72Oyq_Xv7tlnvK7SgK4WEyjbBm5RQuJVVJiKsjLcCkozgghTgPPjowQr0iFKQTwqCii4YUEv2dLu9GrpjyV9YfruLpbta1B9hm0SP</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><source>arXiv.org</source><creator>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creator><creatorcontrib>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</creatorcontrib><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><identifier>DOI: 10.48550/arxiv.2404.12314</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.12314$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.12314$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><description>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KAzEUBeBsupDqA7gyLzDj5Pemy1LrVCgI0v1wk9xgYJwpmano29sfV-csDgc-xh5FU2tnTPOM5Sd_11I3uhZSCX3H1u0pR4r8JU-h0EznktJpyuPA01j4tqcwl3HIge8I-_mTf1AYS-QtDVRwPu_u2SJhP9HDfy7Z4XV72Oyq_Xv7tlnvK7SgK4WEyjbBm5RQuJVVJiKsjLcCkozgghTgPPjowQr0iFKQTwqCii4YUEv2dLu9GrpjyV9YfruLpbta1B9hm0SP</recordid><startdate>20240418</startdate><enddate>20240418</enddate><creator>Han, Jun</creator><creator>Chen, Zixiang</creator><creator>Li, Yongqian</creator><creator>Kou, Yiwen</creator><creator>Halperin, Eran</creator><creator>Tillman, Robert E</creator><creator>Gu, Quanquan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240418</creationdate><title>Guided Discrete Diffusion for Electronic Health Record Generation</title><author>Han, Jun ; Chen, Zixiang ; Li, Yongqian ; Kou, Yiwen ; Halperin, Eran ; Tillman, Robert E ; Gu, Quanquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-3aea360cb5ffa189635da795b617f2d78c2178b7bdb761abaa21ebf37c3d8c573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Jun</creatorcontrib><creatorcontrib>Chen, Zixiang</creatorcontrib><creatorcontrib>Li, Yongqian</creatorcontrib><creatorcontrib>Kou, Yiwen</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Tillman, Robert E</creatorcontrib><creatorcontrib>Gu, Quanquan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Han, Jun</au><au>Chen, Zixiang</au><au>Li, Yongqian</au><au>Kou, Yiwen</au><au>Halperin, Eran</au><au>Tillman, Robert E</au><au>Gu, Quanquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Guided Discrete Diffusion for Electronic Health Record Generation</atitle><date>2024-04-18</date><risdate>2024</risdate><abstract>Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.</abstract><doi>10.48550/arxiv.2404.12314</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2404.12314
ispartof
issn
language eng
recordid cdi_arxiv_primary_2404_12314
source arXiv.org
subjects Computer Science - Learning
title Guided Discrete Diffusion for Electronic Health Record Generation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T14%3A30%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Guided%20Discrete%20Diffusion%20for%20Electronic%20Health%20Record%20Generation&rft.au=Han,%20Jun&rft.date=2024-04-18&rft_id=info:doi/10.48550/arxiv.2404.12314&rft_dat=%3Carxiv_GOX%3E2404_12314%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true