STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION
Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one sho...
Gespeichert in:
Veröffentlicht in: | The annals of applied statistics 2018-06, Vol.12 (2), p.685-726 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 726 |
---|---|
container_issue | 2 |
container_start_page | 685 |
container_title | The annals of applied statistics |
container_volume | 12 |
creator | Meng, Xiao-Li |
description | Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average X̅n
and the population average X̅N
is the product of three terms: (1) a data quality measure, ρR, X, the correlation between Xj
and the response/recording indicator Rj
; (2) a data quantity measure,
(
N
−
n
)
/
n
, where N is the population size; and (3) a problem difficulty measure, σX
, the standard deviation of X. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling ρR, X
at the level of N
−1/2; (II) When we lose this control, the impact of N is no longer canceled by ρR, X
, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size f = n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR, X
≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95% confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: |
doi_str_mv | 10.1214/18-AOAS1161SF |
format | Article |
fullrecord | <record><control><sourceid>jstor_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1214_18_AOAS1161SF</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>26542550</jstor_id><sourcerecordid>26542550</sourcerecordid><originalsourceid>FETCH-LOGICAL-c298t-5ef69abb6c1a5db25171fb962483817c1f72a88fb75ce63ee03a202af44efc403</originalsourceid><addsrcrecordid>eNpFj89LwzAcxYMoODePHoUe9RCXb373GNdtBsYqpoK3ksYEHMqk2cX_fisVPb334MODD0I3QB6AAp-DxqY2DkCCW52hCZQcsGKMnA-dUSxBqEt0lfOOEME1hwlirjGNdY1dmE3xbF5MZd3SFWZbjat-Oy27LR7tuqhMY4o7ez9DF8l_5nj9m1P0ulo2iye8qdfDDw601AcsYpKl7zoZwIv3jgpQkLpSUq6ZBhUgKeq1Tp0SIUoWI2GeEuoT5zEFTtgU4fE39Puc-5ja7_7jy_c_LZB2MG5Bt__GJ_525Hf5sO__YCoFp0IQdgQnuEy3</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION</title><source>JSTOR Mathematics & Statistics</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Project Euclid Complete</source><source>JSTOR</source><creator>Meng, Xiao-Li</creator><creatorcontrib>Meng, Xiao-Li</creatorcontrib><description>Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average X̅n
and the population average X̅N
is the product of three terms: (1) a data quality measure, ρR, X, the correlation between Xj
and the response/recording indicator Rj
; (2) a data quantity measure,
(
N
−
n
)
/
n
, where N is the population size; and (3) a problem difficulty measure, σX
, the standard deviation of X. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling ρR, X
at the level of N
−1/2; (II) When we lose this control, the impact of N is no longer canceled by ρR, X
, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size f = n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR, X
≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95% confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.</description><identifier>ISSN: 1932-6157</identifier><identifier>EISSN: 1941-7330</identifier><identifier>DOI: 10.1214/18-AOAS1161SF</identifier><language>eng</language><publisher>Institute of Mathematical Statistics</publisher><subject>SPECIAL SECTION IN MEMORY OF STEPHEN E. FIENBERG (1942–2016) AOAS EDITOR-IN-CHIEF 2013–2015</subject><ispartof>The annals of applied statistics, 2018-06, Vol.12 (2), p.685-726</ispartof><rights>Institute of Mathematical Statistics, 2018</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c298t-5ef69abb6c1a5db25171fb962483817c1f72a88fb75ce63ee03a202af44efc403</citedby><cites>FETCH-LOGICAL-c298t-5ef69abb6c1a5db25171fb962483817c1f72a88fb75ce63ee03a202af44efc403</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/26542550$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/26542550$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,27924,27925,58017,58021,58250,58254</link.rule.ids></links><search><creatorcontrib>Meng, Xiao-Li</creatorcontrib><title>STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION</title><title>The annals of applied statistics</title><description>Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average X̅n
and the population average X̅N
is the product of three terms: (1) a data quality measure, ρR, X, the correlation between Xj
and the response/recording indicator Rj
; (2) a data quantity measure,
(
N
−
n
)
/
n
, where N is the population size; and (3) a problem difficulty measure, σX
, the standard deviation of X. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling ρR, X
at the level of N
−1/2; (II) When we lose this control, the impact of N is no longer canceled by ρR, X
, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size f = n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR, X
≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95% confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.</description><subject>SPECIAL SECTION IN MEMORY OF STEPHEN E. FIENBERG (1942–2016) AOAS EDITOR-IN-CHIEF 2013–2015</subject><issn>1932-6157</issn><issn>1941-7330</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNpFj89LwzAcxYMoODePHoUe9RCXb373GNdtBsYqpoK3ksYEHMqk2cX_fisVPb334MODD0I3QB6AAp-DxqY2DkCCW52hCZQcsGKMnA-dUSxBqEt0lfOOEME1hwlirjGNdY1dmE3xbF5MZd3SFWZbjat-Oy27LR7tuqhMY4o7ez9DF8l_5nj9m1P0ulo2iye8qdfDDw601AcsYpKl7zoZwIv3jgpQkLpSUq6ZBhUgKeq1Tp0SIUoWI2GeEuoT5zEFTtgU4fE39Puc-5ja7_7jy_c_LZB2MG5Bt__GJ_525Hf5sO__YCoFp0IQdgQnuEy3</recordid><startdate>20180601</startdate><enddate>20180601</enddate><creator>Meng, Xiao-Li</creator><general>Institute of Mathematical Statistics</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20180601</creationdate><title>STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I)</title><author>Meng, Xiao-Li</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c298t-5ef69abb6c1a5db25171fb962483817c1f72a88fb75ce63ee03a202af44efc403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>SPECIAL SECTION IN MEMORY OF STEPHEN E. FIENBERG (1942–2016) AOAS EDITOR-IN-CHIEF 2013–2015</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Meng, Xiao-Li</creatorcontrib><collection>CrossRef</collection><jtitle>The annals of applied statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Meng, Xiao-Li</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION</atitle><jtitle>The annals of applied statistics</jtitle><date>2018-06-01</date><risdate>2018</risdate><volume>12</volume><issue>2</issue><spage>685</spage><epage>726</epage><pages>685-726</pages><issn>1932-6157</issn><eissn>1941-7330</eissn><abstract>Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average X̅n
and the population average X̅N
is the product of three terms: (1) a data quality measure, ρR, X, the correlation between Xj
and the response/recording indicator Rj
; (2) a data quantity measure,
(
N
−
n
)
/
n
, where N is the population size; and (3) a problem difficulty measure, σX
, the standard deviation of X. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling ρR, X
at the level of N
−1/2; (II) When we lose this control, the impact of N is no longer canceled by ρR, X
, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size f = n/N, not the absolute size n; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR, X
≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95% confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.</abstract><pub>Institute of Mathematical Statistics</pub><doi>10.1214/18-AOAS1161SF</doi><tpages>42</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1932-6157 |
ispartof | The annals of applied statistics, 2018-06, Vol.12 (2), p.685-726 |
issn | 1932-6157 1941-7330 |
language | eng |
recordid | cdi_crossref_primary_10_1214_18_AOAS1161SF |
source | JSTOR Mathematics & Statistics; EZB-FREE-00999 freely available EZB journals; Project Euclid Complete; JSTOR |
subjects | SPECIAL SECTION IN MEMORY OF STEPHEN E. FIENBERG (1942–2016) AOAS EDITOR-IN-CHIEF 2013–2015 |
title | STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T13%3A38%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=STATISTICAL%20PARADISES%20AND%20PARADOXES%20IN%20BIG%20DATA%20(I):%20LAW%20OF%20LARGE%20POPULATIONS,%20BIG%20DATA%20PARADOX,%20AND%20THE%202016%20US%20PRESIDENTIAL%20ELECTION&rft.jtitle=The%20annals%20of%20applied%20statistics&rft.au=Meng,%20Xiao-Li&rft.date=2018-06-01&rft.volume=12&rft.issue=2&rft.spage=685&rft.epage=726&rft.pages=685-726&rft.issn=1932-6157&rft.eissn=1941-7330&rft_id=info:doi/10.1214/18-AOAS1161SF&rft_dat=%3Cjstor_cross%3E26542550%3C/jstor_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_jstor_id=26542550&rfr_iscdi=true |