Stratified random sampling from streaming and stored data
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have,...
Gespeichert in:
Veröffentlicht in: | Distributed and parallel databases : an international journal 2021-09, Vol.39 (3), p.665-710 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
Ω
(
r
)
factor away from the optimal, where
r
is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is
locally variance-optimal
. We prove that any sliding window-based streaming SRS needs a workspace of
Ω
(
r
M
log
W
)
in the worst case, to maintain a variance-optimal SRS of size
M
, where
W
is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only
O
(
M
) workspace but can maintain an SRS of size close to
M
in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known
Neyman allocation
, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data. |
---|---|
ISSN: | 0926-8782 1573-7578 |
DOI: | 10.1007/s10619-020-07315-w |