Zero Inflation as a Missing Data Problem: a Proxy-based Approach

A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Phung, Trung, Lee, Jaron J. R, Oladapo-Shittu, Opeyemi, Klein, Eili Y, Gurses, Ayse Pinar, Hannum, Susan M, Weems, Kimberly, Marsteller, Jill A, Cosgrove, Sara E, Keller, Sara C, Shpitser, Ilya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Phung, Trung
Lee, Jaron J. R
Oladapo-Shittu, Opeyemi
Klein, Eili Y
Gurses, Ayse Pinar
Hannum, Susan M
Weems, Kimberly
Marsteller, Jill A
Cosgrove, Sara E
Keller, Sara C
Shpitser, Ilya
description A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).
doi_str_mv 10.48550/arxiv.2406.00549
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_00549</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_00549</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-f928d5a3c3fa9802f728163809293d65f505c9fc6e9b10771d233df960577cd63</originalsourceid><addsrcrecordid>eNotz7tuwjAYhmEvHSraC-iEbyDhjx2fOoHoCQkEA1OX6I8PraWQRDaq4O6rpkzfO33SQ8hTBWWthYAFpkv8KVkNsgQQtbkny0-fBrrpQ4fnOPQUM0W6iznH_ou-4BnpIQ1t50_PdMrLtWgxe0dX45gGtN8P5C5gl_3jbWfk-PZ6XH8U2_37Zr3aFiiVKYJh2gnklgc0GlhQTFeSazDMcCdFECCsCVZ601agVOUY5y4YCUIp6ySfkfn_7URoxhRPmK7NH6WZKPwXHOZB0A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Zero Inflation as a Missing Data Problem: a Proxy-based Approach</title><source>arXiv.org</source><creator>Phung, Trung ; Lee, Jaron J. R ; Oladapo-Shittu, Opeyemi ; Klein, Eili Y ; Gurses, Ayse Pinar ; Hannum, Susan M ; Weems, Kimberly ; Marsteller, Jill A ; Cosgrove, Sara E ; Keller, Sara C ; Shpitser, Ilya</creator><creatorcontrib>Phung, Trung ; Lee, Jaron J. R ; Oladapo-Shittu, Opeyemi ; Klein, Eili Y ; Gurses, Ayse Pinar ; Hannum, Susan M ; Weems, Kimberly ; Marsteller, Jill A ; Cosgrove, Sara E ; Keller, Sara C ; Shpitser, Ilya</creatorcontrib><description>A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).</description><identifier>DOI: 10.48550/arxiv.2406.00549</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Statistics - Methodology</subject><creationdate>2024-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.00549$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.00549$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Phung, Trung</creatorcontrib><creatorcontrib>Lee, Jaron J. R</creatorcontrib><creatorcontrib>Oladapo-Shittu, Opeyemi</creatorcontrib><creatorcontrib>Klein, Eili Y</creatorcontrib><creatorcontrib>Gurses, Ayse Pinar</creatorcontrib><creatorcontrib>Hannum, Susan M</creatorcontrib><creatorcontrib>Weems, Kimberly</creatorcontrib><creatorcontrib>Marsteller, Jill A</creatorcontrib><creatorcontrib>Cosgrove, Sara E</creatorcontrib><creatorcontrib>Keller, Sara C</creatorcontrib><creatorcontrib>Shpitser, Ilya</creatorcontrib><title>Zero Inflation as a Missing Data Problem: a Proxy-based Approach</title><description>A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).</description><subject>Computer Science - Artificial Intelligence</subject><subject>Statistics - Methodology</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tuwjAYhmEvHSraC-iEbyDhjx2fOoHoCQkEA1OX6I8PraWQRDaq4O6rpkzfO33SQ8hTBWWthYAFpkv8KVkNsgQQtbkny0-fBrrpQ4fnOPQUM0W6iznH_ou-4BnpIQ1t50_PdMrLtWgxe0dX45gGtN8P5C5gl_3jbWfk-PZ6XH8U2_37Zr3aFiiVKYJh2gnklgc0GlhQTFeSazDMcCdFECCsCVZ601agVOUY5y4YCUIp6ySfkfn_7URoxhRPmK7NH6WZKPwXHOZB0A</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Phung, Trung</creator><creator>Lee, Jaron J. R</creator><creator>Oladapo-Shittu, Opeyemi</creator><creator>Klein, Eili Y</creator><creator>Gurses, Ayse Pinar</creator><creator>Hannum, Susan M</creator><creator>Weems, Kimberly</creator><creator>Marsteller, Jill A</creator><creator>Cosgrove, Sara E</creator><creator>Keller, Sara C</creator><creator>Shpitser, Ilya</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20240601</creationdate><title>Zero Inflation as a Missing Data Problem: a Proxy-based Approach</title><author>Phung, Trung ; Lee, Jaron J. R ; Oladapo-Shittu, Opeyemi ; Klein, Eili Y ; Gurses, Ayse Pinar ; Hannum, Susan M ; Weems, Kimberly ; Marsteller, Jill A ; Cosgrove, Sara E ; Keller, Sara C ; Shpitser, Ilya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-f928d5a3c3fa9802f728163809293d65f505c9fc6e9b10771d233df960577cd63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Statistics - Methodology</topic><toplevel>online_resources</toplevel><creatorcontrib>Phung, Trung</creatorcontrib><creatorcontrib>Lee, Jaron J. R</creatorcontrib><creatorcontrib>Oladapo-Shittu, Opeyemi</creatorcontrib><creatorcontrib>Klein, Eili Y</creatorcontrib><creatorcontrib>Gurses, Ayse Pinar</creatorcontrib><creatorcontrib>Hannum, Susan M</creatorcontrib><creatorcontrib>Weems, Kimberly</creatorcontrib><creatorcontrib>Marsteller, Jill A</creatorcontrib><creatorcontrib>Cosgrove, Sara E</creatorcontrib><creatorcontrib>Keller, Sara C</creatorcontrib><creatorcontrib>Shpitser, Ilya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Phung, Trung</au><au>Lee, Jaron J. R</au><au>Oladapo-Shittu, Opeyemi</au><au>Klein, Eili Y</au><au>Gurses, Ayse Pinar</au><au>Hannum, Susan M</au><au>Weems, Kimberly</au><au>Marsteller, Jill A</au><au>Cosgrove, Sara E</au><au>Keller, Sara C</au><au>Shpitser, Ilya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Zero Inflation as a Missing Data Problem: a Proxy-based Approach</atitle><date>2024-06-01</date><risdate>2024</risdate><abstract>A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).</abstract><doi>10.48550/arxiv.2406.00549</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2406.00549
ispartof
issn
language eng
recordid cdi_arxiv_primary_2406_00549
source arXiv.org
subjects Computer Science - Artificial Intelligence
Statistics - Methodology
title Zero Inflation as a Missing Data Problem: a Proxy-based Approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T04%3A45%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Zero%20Inflation%20as%20a%20Missing%20Data%20Problem:%20a%20Proxy-based%20Approach&rft.au=Phung,%20Trung&rft.date=2024-06-01&rft_id=info:doi/10.48550/arxiv.2406.00549&rft_dat=%3Carxiv_GOX%3E2406_00549%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true