Optimal sampling from sliding windows
A sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk an...
Gespeichert in:
Veröffentlicht in: | Journal of computer and system sciences 2012, Vol.78 (1), p.260-272 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A
sliding windows model is an important case of the streaming model, where only the most “recent” elements remain active and the rest are discarded. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model – windows with fixed size (e.g., where items arrive one at a time, and only the most recent
n items remain active for some fixed parameter
n), and timestamp-based windows (e.g., where many items can arrive in “bursts” at a single step and where only items from the last
t steps remain active, again for some fixed parameter
t).
Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are
implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly “translated” to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of over-sampling. When
k samples are required, the over-sampling method maintains
k
′
>
k
samples in the hope that at least
k samples are not expired. The obvious disadvantages of this method are twofold:
(a)
It introduces additional costs and thus decreases the performance; and
(b)
The memory bounds are not deterministic, which is atypical for streaming algorithms (where even small probability events may eventually happen for a stream that is long enough).
Babcock, Datar and Motwani (SODA 02), were the first to stress the importance of improvements to over-sampling. They formally introduced the problem of sampling from sliding windows and improved the over-sampling method for
sampling with replacement. Their elegant solutions for sampling with replacement are optimal
in expectation, and thus resolve disadvantage (a) mentioned above. Unfortunately, the randomized bounds do not resolve disadvantage (b) above. Interestingly, all algorithms that employ the ideas of Babcock, Datar and Motwani have the same central problem of having to deal with a ra |
---|---|
ISSN: | 0022-0000 1090-2724 |
DOI: | 10.1016/j.jcss.2011.04.004 |