Random sampling techniques for space efficient online computation of order statistics of large datasets

In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SIGMOD record 1999-06, Vol.28 (2), p.251-262
Hauptverfasser:	Manku, Gurmeet Singh, Rajagopalan, Sridhar, Lindsay, Bruce G.
Format:	Artikel
Sprache:	eng
Schlagworte:	Concurrency Data management systems Database management system engines Database query processing Database query processing and optimization (theory) Database theory Discrete mathematics Graph theory Information systems Mathematics of computing Models of computation Parallel computing models Probability and statistics Theory and algorithms for application domains Theory of computation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	262
container_issue	2
container_start_page	251
container_title	SIGMOD record
container_volume	28
creator	Manku, Gurmeet Singh Rajagopalan, Sridhar Lindsay, Bruce G.
description	In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.
doi_str_mv	10.1145/304181.304204
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_29463374</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>29463374</sourcerecordid><originalsourceid>FETCH-LOGICAL-a2384-68197156d3f4baff1e274b067a87459f1c9ccf504fafec3f7a6253b90541fcbf3</originalsourceid><addsrcrecordid>eNo90E1LxDAQBuAgCq6rRw_ecvLWNWmSfhxl8QsWBNFzmWZn1kjb1CR78N_bUvE0w8zDwLyMXUuxkVKbOyW0rORmKrnQJ2wla60yUylzylZCFnMvqnN2EeOXEJMsxIod3mDY-55H6MfODQee0H4O7vuIkZMPPI5gkSORsw6HxP0wKeTW9-MxQXJ-4J64D3uc7DyIydk4zzoIB-R7SBAxxUt2RtBFvPqra_bx-PC-fc52r08v2_tdBrmqdFZUsi6lKfaKdAtEEvNSt6IooSq1qUna2loyQhMQWkUlFLlRbS2MlmRbUmt2u9wdg5-fSE3vosWugwH9MTZ5rQulSj3BbIE2-BgDUjMG10P4aaRo5jibJc5miXPyN4sH2__Tv90v2clxaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>29463374</pqid></control><display><type>article</type><title>Random sampling techniques for space efficient online computation of order statistics of large datasets</title><source>ACM Digital Library Complete</source><creator>Manku, Gurmeet Singh ; Rajagopalan, Sridhar ; Lindsay, Bruce G.</creator><creatorcontrib>Manku, Gurmeet Singh ; Rajagopalan, Sridhar ; Lindsay, Bruce G.</creatorcontrib><description>In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.</description><identifier>ISSN: 0163-5808</identifier><identifier>EISSN: 1943-5835</identifier><identifier>DOI: 10.1145/304181.304204</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Concurrency ; Data management systems ; Database management system engines ; Database query processing ; Database query processing and optimization (theory) ; Database theory ; Discrete mathematics ; Graph theory ; Information systems ; Mathematics of computing ; Models of computation ; Parallel computing models ; Probability and statistics ; Theory and algorithms for application domains ; Theory of computation</subject><ispartof>SIGMOD record, 1999-06, Vol.28 (2), p.251-262</ispartof><rights>ACM</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a2384-68197156d3f4baff1e274b067a87459f1c9ccf504fafec3f7a6253b90541fcbf3</citedby><cites>FETCH-LOGICAL-a2384-68197156d3f4baff1e274b067a87459f1c9ccf504fafec3f7a6253b90541fcbf3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/304181.304204$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,776,780,2275,27902,27903,40174,75973</link.rule.ids></links><search><creatorcontrib>Manku, Gurmeet Singh</creatorcontrib><creatorcontrib>Rajagopalan, Sridhar</creatorcontrib><creatorcontrib>Lindsay, Bruce G.</creatorcontrib><title>Random sampling techniques for space efficient online computation of order statistics of large datasets</title><title>SIGMOD record</title><addtitle>ACM SIGMOD</addtitle><description>In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.</description><subject>Concurrency</subject><subject>Data management systems</subject><subject>Database management system engines</subject><subject>Database query processing</subject><subject>Database query processing and optimization (theory)</subject><subject>Database theory</subject><subject>Discrete mathematics</subject><subject>Graph theory</subject><subject>Information systems</subject><subject>Mathematics of computing</subject><subject>Models of computation</subject><subject>Parallel computing models</subject><subject>Probability and statistics</subject><subject>Theory and algorithms for application domains</subject><subject>Theory of computation</subject><issn>0163-5808</issn><issn>1943-5835</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>1999</creationdate><recordtype>article</recordtype><recordid>eNo90E1LxDAQBuAgCq6rRw_ecvLWNWmSfhxl8QsWBNFzmWZn1kjb1CR78N_bUvE0w8zDwLyMXUuxkVKbOyW0rORmKrnQJ2wla60yUylzylZCFnMvqnN2EeOXEJMsxIod3mDY-55H6MfODQee0H4O7vuIkZMPPI5gkSORsw6HxP0wKeTW9-MxQXJ-4J64D3uc7DyIydk4zzoIB-R7SBAxxUt2RtBFvPqra_bx-PC-fc52r08v2_tdBrmqdFZUsi6lKfaKdAtEEvNSt6IooSq1qUna2loyQhMQWkUlFLlRbS2MlmRbUmt2u9wdg5-fSE3vosWugwH9MTZ5rQulSj3BbIE2-BgDUjMG10P4aaRo5jibJc5miXPyN4sH2__Tv90v2clxaQ</recordid><startdate>19990601</startdate><enddate>19990601</enddate><creator>Manku, Gurmeet Singh</creator><creator>Rajagopalan, Sridhar</creator><creator>Lindsay, Bruce G.</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>19990601</creationdate><title>Random sampling techniques for space efficient online computation of order statistics of large datasets</title><author>Manku, Gurmeet Singh ; Rajagopalan, Sridhar ; Lindsay, Bruce G.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a2384-68197156d3f4baff1e274b067a87459f1c9ccf504fafec3f7a6253b90541fcbf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>1999</creationdate><topic>Concurrency</topic><topic>Data management systems</topic><topic>Database management system engines</topic><topic>Database query processing</topic><topic>Database query processing and optimization (theory)</topic><topic>Database theory</topic><topic>Discrete mathematics</topic><topic>Graph theory</topic><topic>Information systems</topic><topic>Mathematics of computing</topic><topic>Models of computation</topic><topic>Parallel computing models</topic><topic>Probability and statistics</topic><topic>Theory and algorithms for application domains</topic><topic>Theory of computation</topic><toplevel>online_resources</toplevel><creatorcontrib>Manku, Gurmeet Singh</creatorcontrib><creatorcontrib>Rajagopalan, Sridhar</creatorcontrib><creatorcontrib>Lindsay, Bruce G.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>SIGMOD record</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Manku, Gurmeet Singh</au><au>Rajagopalan, Sridhar</au><au>Lindsay, Bruce G.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Random sampling techniques for space efficient online computation of order statistics of large datasets</atitle><jtitle>SIGMOD record</jtitle><stitle>ACM SIGMOD</stitle><date>1999-06-01</date><risdate>1999</risdate><volume>28</volume><issue>2</issue><spage>251</spage><epage>262</epage><pages>251-262</pages><issn>0163-5808</issn><eissn>1943-5835</eissn><abstract>In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/304181.304204</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0163-5808
ispartof	SIGMOD record, 1999-06, Vol.28 (2), p.251-262
issn	0163-5808 1943-5835
language	eng
recordid	cdi_proquest_miscellaneous_29463374
source	ACM Digital Library Complete
subjects	Concurrency Data management systems Database management system engines Database query processing Database query processing and optimization (theory) Database theory Discrete mathematics Graph theory Information systems Mathematics of computing Models of computation Parallel computing models Probability and statistics Theory and algorithms for application domains Theory of computation
title	Random sampling techniques for space efficient online computation of order statistics of large datasets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T10%3A00%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Random%20sampling%20techniques%20for%20space%20efficient%20online%20computation%20of%20order%20statistics%20of%20large%20datasets&rft.jtitle=SIGMOD%20record&rft.au=Manku,%20Gurmeet%20Singh&rft.date=1999-06-01&rft.volume=28&rft.issue=2&rft.spage=251&rft.epage=262&rft.pages=251-262&rft.issn=0163-5808&rft.eissn=1943-5835&rft_id=info:doi/10.1145/304181.304204&rft_dat=%3Cproquest_cross%3E29463374%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=29463374&rft_id=info:pmid/&rfr_iscdi=true