Stratified random sampling from streaming and stored data
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have,...
Gespeichert in:
Veröffentlicht in: | Distributed and parallel databases : an international journal 2021-09, Vol.39 (3), p.665-710 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 710 |
---|---|
container_issue | 3 |
container_start_page | 665 |
container_title | Distributed and parallel databases : an international journal |
container_volume | 39 |
creator | Nguyen, Trong Duc Shih, Ming-Hung Srivastava, Divesh Tirthapura, Srikanta Xu, Bojian |
description | Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
Ω
(
r
)
factor away from the optimal, where
r
is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is
locally variance-optimal
. We prove that any sliding window-based streaming SRS needs a workspace of
Ω
(
r
M
log
W
)
in the worst case, to maintain a variance-optimal SRS of size
M
, where
W
is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only
O
(
M
) workspace but can maintain an SRS of size close to
M
in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known
Neyman allocation
, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data. |
doi_str_mv | 10.1007/s10619-020-07315-w |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2572250538</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2572250538</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-7b7d6b4645655a2e2a0f6d71804b5750c2e381f800078d42a867e4186a157903</originalsourceid><addsrcrecordid>eNp9kM1LxDAQxYMoWFf_AU8Fz9FJ0mTSoyx-wYIH9x7SNl26bD9Msiz-96ZW8OZpeMzvzTweIbcM7hkAPgQGipUUOFBAwSQ9nZGMSRQUJepzkkHJFdWo-SW5CmEPACUyzEj5Eb2NXdu5Jvd2aMY-D7afDt2wy1s_q-id7WeZtkmNPpGNjfaaXLT2ENzN71yR7fPTdv1KN-8vb-vHDa0FKyPFChtVFaqQSkrLHbfQqgaZhqKSKKHmTmjW6pQIdVNwqxW6gmllU_wSxIrcLWcnP34eXYhmPx79kD4aLpFzCVLoRPGFqv0YgnetmXzXW_9lGJi5IbM0ZFJD5qchc0omsZhCgoed83-n_3F9A6d5Z5s</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2572250538</pqid></control><display><type>article</type><title>Stratified random sampling from streaming and stored data</title><source>SpringerLink Journals - AutoHoldings</source><creator>Nguyen, Trong Duc ; Shih, Ming-Hung ; Srivastava, Divesh ; Tirthapura, Srikanta ; Xu, Bojian</creator><creatorcontrib>Nguyen, Trong Duc ; Shih, Ming-Hung ; Srivastava, Divesh ; Tirthapura, Srikanta ; Xu, Bojian</creatorcontrib><description>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
Ω
(
r
)
factor away from the optimal, where
r
is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is
locally variance-optimal
. We prove that any sliding window-based streaming SRS needs a workspace of
Ω
(
r
M
log
W
)
in the worst case, to maintain a variance-optimal SRS of size
M
, where
W
is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only
O
(
M
) workspace but can maintain an SRS of size close to
M
in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known
Neyman allocation
, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</description><identifier>ISSN: 0926-8782</identifier><identifier>EISSN: 1573-7578</identifier><identifier>DOI: 10.1007/s10619-020-07315-w</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Algorithms ; Computer Science ; Data Structures ; Data transmission ; Database Management ; Information Systems Applications (incl.Internet) ; Lower bounds ; Memory Structures ; Multilayers ; Operating Systems ; Query processing ; Random sampling ; Sampling methods ; Sliding ; Variance</subject><ispartof>Distributed and parallel databases : an international journal, 2021-09, Vol.39 (3), p.665-710</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-7b7d6b4645655a2e2a0f6d71804b5750c2e381f800078d42a867e4186a157903</citedby><cites>FETCH-LOGICAL-c319t-7b7d6b4645655a2e2a0f6d71804b5750c2e381f800078d42a867e4186a157903</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10619-020-07315-w$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10619-020-07315-w$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Nguyen, Trong Duc</creatorcontrib><creatorcontrib>Shih, Ming-Hung</creatorcontrib><creatorcontrib>Srivastava, Divesh</creatorcontrib><creatorcontrib>Tirthapura, Srikanta</creatorcontrib><creatorcontrib>Xu, Bojian</creatorcontrib><title>Stratified random sampling from streaming and stored data</title><title>Distributed and parallel databases : an international journal</title><addtitle>Distrib Parallel Databases</addtitle><description>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
Ω
(
r
)
factor away from the optimal, where
r
is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is
locally variance-optimal
. We prove that any sliding window-based streaming SRS needs a workspace of
Ω
(
r
M
log
W
)
in the worst case, to maintain a variance-optimal SRS of size
M
, where
W
is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only
O
(
M
) workspace but can maintain an SRS of size close to
M
in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known
Neyman allocation
, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</description><subject>Algorithms</subject><subject>Computer Science</subject><subject>Data Structures</subject><subject>Data transmission</subject><subject>Database Management</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Lower bounds</subject><subject>Memory Structures</subject><subject>Multilayers</subject><subject>Operating Systems</subject><subject>Query processing</subject><subject>Random sampling</subject><subject>Sampling methods</subject><subject>Sliding</subject><subject>Variance</subject><issn>0926-8782</issn><issn>1573-7578</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kM1LxDAQxYMoWFf_AU8Fz9FJ0mTSoyx-wYIH9x7SNl26bD9Msiz-96ZW8OZpeMzvzTweIbcM7hkAPgQGipUUOFBAwSQ9nZGMSRQUJepzkkHJFdWo-SW5CmEPACUyzEj5Eb2NXdu5Jvd2aMY-D7afDt2wy1s_q-id7WeZtkmNPpGNjfaaXLT2ENzN71yR7fPTdv1KN-8vb-vHDa0FKyPFChtVFaqQSkrLHbfQqgaZhqKSKKHmTmjW6pQIdVNwqxW6gmllU_wSxIrcLWcnP34eXYhmPx79kD4aLpFzCVLoRPGFqv0YgnetmXzXW_9lGJi5IbM0ZFJD5qchc0omsZhCgoed83-n_3F9A6d5Z5s</recordid><startdate>20210901</startdate><enddate>20210901</enddate><creator>Nguyen, Trong Duc</creator><creator>Shih, Ming-Hung</creator><creator>Srivastava, Divesh</creator><creator>Tirthapura, Srikanta</creator><creator>Xu, Bojian</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20210901</creationdate><title>Stratified random sampling from streaming and stored data</title><author>Nguyen, Trong Duc ; Shih, Ming-Hung ; Srivastava, Divesh ; Tirthapura, Srikanta ; Xu, Bojian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-7b7d6b4645655a2e2a0f6d71804b5750c2e381f800078d42a867e4186a157903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Computer Science</topic><topic>Data Structures</topic><topic>Data transmission</topic><topic>Database Management</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Lower bounds</topic><topic>Memory Structures</topic><topic>Multilayers</topic><topic>Operating Systems</topic><topic>Query processing</topic><topic>Random sampling</topic><topic>Sampling methods</topic><topic>Sliding</topic><topic>Variance</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Trong Duc</creatorcontrib><creatorcontrib>Shih, Ming-Hung</creatorcontrib><creatorcontrib>Srivastava, Divesh</creatorcontrib><creatorcontrib>Tirthapura, Srikanta</creatorcontrib><creatorcontrib>Xu, Bojian</creatorcontrib><collection>CrossRef</collection><jtitle>Distributed and parallel databases : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nguyen, Trong Duc</au><au>Shih, Ming-Hung</au><au>Srivastava, Divesh</au><au>Tirthapura, Srikanta</au><au>Xu, Bojian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stratified random sampling from streaming and stored data</atitle><jtitle>Distributed and parallel databases : an international journal</jtitle><stitle>Distrib Parallel Databases</stitle><date>2021-09-01</date><risdate>2021</risdate><volume>39</volume><issue>3</issue><spage>665</spage><epage>710</epage><pages>665-710</pages><issn>0926-8782</issn><eissn>1573-7578</eissn><abstract>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
Ω
(
r
)
factor away from the optimal, where
r
is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is
locally variance-optimal
. We prove that any sliding window-based streaming SRS needs a workspace of
Ω
(
r
M
log
W
)
in the worst case, to maintain a variance-optimal SRS of size
M
, where
W
is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only
O
(
M
) workspace but can maintain an SRS of size close to
M
in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known
Neyman allocation
, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10619-020-07315-w</doi><tpages>46</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0926-8782 |
ispartof | Distributed and parallel databases : an international journal, 2021-09, Vol.39 (3), p.665-710 |
issn | 0926-8782 1573-7578 |
language | eng |
recordid | cdi_proquest_journals_2572250538 |
source | SpringerLink Journals - AutoHoldings |
subjects | Algorithms Computer Science Data Structures Data transmission Database Management Information Systems Applications (incl.Internet) Lower bounds Memory Structures Multilayers Operating Systems Query processing Random sampling Sampling methods Sliding Variance |
title | Stratified random sampling from streaming and stored data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T22%3A00%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stratified%20random%20sampling%20from%20streaming%20and%20stored%20data&rft.jtitle=Distributed%20and%20parallel%20databases%20:%20an%20international%20journal&rft.au=Nguyen,%20Trong%20Duc&rft.date=2021-09-01&rft.volume=39&rft.issue=3&rft.spage=665&rft.epage=710&rft.pages=665-710&rft.issn=0926-8782&rft.eissn=1573-7578&rft_id=info:doi/10.1007/s10619-020-07315-w&rft_dat=%3Cproquest_cross%3E2572250538%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2572250538&rft_id=info:pmid/&rfr_iscdi=true |