Load Balancing in MapReduce Based on Scalable Cardinality Estimates

MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gufler, B., Augsten, N., Reiser, A., Kemper, A.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Approximation methods Clustering algorithms Estimation Histograms Load management Monitoring Partitioning algorithms
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	533
container_issue
container_start_page	522
container_title
container_volume
creator	Gufler, B. Augsten, N. Reiser, A. Kemper, A.
description	MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.
doi_str_mv	10.1109/ICDE.2012.58
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6228111</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6228111</ieee_id><sourcerecordid>6228111</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-a38226705c2d1e061f42a76a7af673ed3c95aea5d92ca717af22149f450891953</originalsourceid><addsrcrecordid>eNotjktLxEAQhMcXGNe9efMyfyBrd88rc9S46kJE8AHeljYzkZGYXTbxsP_eAa1LwUdRVUJcICwQwV-t6tvlggBpYaoDcQbOeqOddtWhKEg5UwLZ9yMx965CbZ0C0ITHokCwqrSqolMxH8cvyPIa0UAh6mbDQd5wz0Obhk-ZBvnI2-cYftqY8RiD3Azypc2Bjz7KmnchDdynaS-X45S-eYrjuTjpuB_j_N9n4u1u-Vo_lM3T_aq-bsqEzkwl5wNkHZiWAkaw2GliZ9lxl7_GoFpvOLIJnlp2mDERat9pA5VHb9RMXP71phjjervL67v92hJViKh-ARjXTKU</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Load Balancing in MapReduce Based on Scalable Cardinality Estimates</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Gufler, B. ; Augsten, N. ; Reiser, A. ; Kemper, A.</creator><creatorcontrib>Gufler, B. ; Augsten, N. ; Reiser, A. ; Kemper, A.</creatorcontrib><description>MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.</description><identifier>ISSN: 1063-6382</identifier><identifier>ISBN: 9781467300421</identifier><identifier>ISBN: 146730042X</identifier><identifier>EISSN: 2375-026X</identifier><identifier>EISBN: 0769547478</identifier><identifier>EISBN: 9780769547473</identifier><identifier>DOI: 10.1109/ICDE.2012.58</identifier><language>eng</language><publisher>IEEE</publisher><subject>Approximation methods ; Clustering algorithms ; Estimation ; Histograms ; Load management ; Monitoring ; Partitioning algorithms</subject><ispartof>2012 IEEE 28th International Conference on Data Engineering, 2012, p.522-533</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6228111$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,2052,27906,54901</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6228111$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Gufler, B.</creatorcontrib><creatorcontrib>Augsten, N.</creatorcontrib><creatorcontrib>Reiser, A.</creatorcontrib><creatorcontrib>Kemper, A.</creatorcontrib><title>Load Balancing in MapReduce Based on Scalable Cardinality Estimates</title><title>2012 IEEE 28th International Conference on Data Engineering</title><addtitle>icde</addtitle><description>MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.</description><subject>Approximation methods</subject><subject>Clustering algorithms</subject><subject>Estimation</subject><subject>Histograms</subject><subject>Load management</subject><subject>Monitoring</subject><subject>Partitioning algorithms</subject><issn>1063-6382</issn><issn>2375-026X</issn><isbn>9781467300421</isbn><isbn>146730042X</isbn><isbn>0769547478</isbn><isbn>9780769547473</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotjktLxEAQhMcXGNe9efMyfyBrd88rc9S46kJE8AHeljYzkZGYXTbxsP_eAa1LwUdRVUJcICwQwV-t6tvlggBpYaoDcQbOeqOddtWhKEg5UwLZ9yMx965CbZ0C0ITHokCwqrSqolMxH8cvyPIa0UAh6mbDQd5wz0Obhk-ZBvnI2-cYftqY8RiD3Azypc2Bjz7KmnchDdynaS-X45S-eYrjuTjpuB_j_N9n4u1u-Vo_lM3T_aq-bsqEzkwl5wNkHZiWAkaw2GliZ9lxl7_GoFpvOLIJnlp2mDERat9pA5VHb9RMXP71phjjervL67v92hJViKh-ARjXTKU</recordid><startdate>201204</startdate><enddate>201204</enddate><creator>Gufler, B.</creator><creator>Augsten, N.</creator><creator>Reiser, A.</creator><creator>Kemper, A.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201204</creationdate><title>Load Balancing in MapReduce Based on Scalable Cardinality Estimates</title><author>Gufler, B. ; Augsten, N. ; Reiser, A. ; Kemper, A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-a38226705c2d1e061f42a76a7af673ed3c95aea5d92ca717af22149f450891953</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Approximation methods</topic><topic>Clustering algorithms</topic><topic>Estimation</topic><topic>Histograms</topic><topic>Load management</topic><topic>Monitoring</topic><topic>Partitioning algorithms</topic><toplevel>online_resources</toplevel><creatorcontrib>Gufler, B.</creatorcontrib><creatorcontrib>Augsten, N.</creatorcontrib><creatorcontrib>Reiser, A.</creatorcontrib><creatorcontrib>Kemper, A.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gufler, B.</au><au>Augsten, N.</au><au>Reiser, A.</au><au>Kemper, A.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Load Balancing in MapReduce Based on Scalable Cardinality Estimates</atitle><btitle>2012 IEEE 28th International Conference on Data Engineering</btitle><stitle>icde</stitle><date>2012-04</date><risdate>2012</risdate><spage>522</spage><epage>533</epage><pages>522-533</pages><issn>1063-6382</issn><eissn>2375-026X</eissn><isbn>9781467300421</isbn><isbn>146730042X</isbn><eisbn>0769547478</eisbn><eisbn>9780769547473</eisbn><abstract>MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.</abstract><pub>IEEE</pub><doi>10.1109/ICDE.2012.58</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-6382
ispartof	2012 IEEE 28th International Conference on Data Engineering, 2012, p.522-533
issn	1063-6382 2375-026X
language	eng
recordid	cdi_ieee_primary_6228111
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Approximation methods Clustering algorithms Estimation Histograms Load management Monitoring Partitioning algorithms
title	Load Balancing in MapReduce Based on Scalable Cardinality Estimates
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T08%3A59%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Load%20Balancing%20in%20MapReduce%20Based%20on%20Scalable%20Cardinality%20Estimates&rft.btitle=2012%20IEEE%2028th%20International%20Conference%20on%20Data%20Engineering&rft.au=Gufler,%20B.&rft.date=2012-04&rft.spage=522&rft.epage=533&rft.pages=522-533&rft.issn=1063-6382&rft.eissn=2375-026X&rft.isbn=9781467300421&rft.isbn_list=146730042X&rft_id=info:doi/10.1109/ICDE.2012.58&rft_dat=%3Cieee_6IE%3E6228111%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=0769547478&rft.eisbn_list=9780769547473&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6228111&rfr_iscdi=true