Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very we...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Big data 2020-06, Vol.8 (3), p.235-247
Hauptverfasser:	Munir, Rana Faisal, Abelló, Alberto, Romero, Oscar, Thiele, Maik, Lehner, Wolfgang
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	247
container_issue	3
container_start_page	235
container_title	Big data
container_volume	8
creator	Munir, Rana Faisal Abelló, Alberto Romero, Oscar Thiele, Maik Lehner, Wolfgang
description	Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.
doi_str_mv	10.1089/big.2019.0068
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2402441600</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402441600</sourcerecordid><originalsourceid>FETCH-LOGICAL-c249t-ddb50a0217aee935938036b722287af0e984177a447408d44be460d14a1f07093</originalsourceid><addsrcrecordid>eNo9kD1PwzAYhC0EolXpyIoysqS8_qgdj6gCitSqDFRis5zEqVw5cbETpPLrSdTSW-6GRzc8CN1jmGHI5FNudzMCWM4AeHaFxgRzkXImvq4vm-MRmsa4hz5CSJbhWzSihEoh6HyM1gvfVHbXBdvskg8dtHPG2VgnlQ_J8pgHWyYrffRdG5NtHKB151qbbvK9KVr7Y5LNobW1_dWt9c0duqm0i2Z67gnavr58LpbpavP2vnhepQVhsk3LMp-DBoKFNkbSuaQZUJ4LQkgmdAVGZgwLoRkTDLKSsdwwDiVmGlcgQNIJejz9HoL_7kxsVW1jYZzTjfFdVIQBYQxzgB5NT2gRfIzBVOoQbK3DUWFQg0TVS1SDRDVI7PmH83WX16a80P_K6B9bC2wb</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2402441600</pqid></control><display><type>article</type><title>Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization</title><source>Alma/SFX Local Collection</source><creator>Munir, Rana Faisal ; Abelló, Alberto ; Romero, Oscar ; Thiele, Maik ; Lehner, Wolfgang</creator><creatorcontrib>Munir, Rana Faisal ; Abelló, Alberto ; Romero, Oscar ; Thiele, Maik ; Lehner, Wolfgang</creatorcontrib><description>Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.</description><identifier>ISSN: 2167-6461</identifier><identifier>EISSN: 2167-647X</identifier><identifier>DOI: 10.1089/big.2019.0068</identifier><identifier>PMID: 32397735</identifier><language>eng</language><publisher>United States</publisher><ispartof>Big data, 2020-06, Vol.8 (3), p.235-247</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c249t-ddb50a0217aee935938036b722287af0e984177a447408d44be460d14a1f07093</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32397735$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Munir, Rana Faisal</creatorcontrib><creatorcontrib>Abelló, Alberto</creatorcontrib><creatorcontrib>Romero, Oscar</creatorcontrib><creatorcontrib>Thiele, Maik</creatorcontrib><creatorcontrib>Lehner, Wolfgang</creatorcontrib><title>Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization</title><title>Big data</title><addtitle>Big Data</addtitle><description>Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.</description><issn>2167-6461</issn><issn>2167-647X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNo9kD1PwzAYhC0EolXpyIoysqS8_qgdj6gCitSqDFRis5zEqVw5cbETpPLrSdTSW-6GRzc8CN1jmGHI5FNudzMCWM4AeHaFxgRzkXImvq4vm-MRmsa4hz5CSJbhWzSihEoh6HyM1gvfVHbXBdvskg8dtHPG2VgnlQ_J8pgHWyYrffRdG5NtHKB151qbbvK9KVr7Y5LNobW1_dWt9c0duqm0i2Z67gnavr58LpbpavP2vnhepQVhsk3LMp-DBoKFNkbSuaQZUJ4LQkgmdAVGZgwLoRkTDLKSsdwwDiVmGlcgQNIJejz9HoL_7kxsVW1jYZzTjfFdVIQBYQxzgB5NT2gRfIzBVOoQbK3DUWFQg0TVS1SDRDVI7PmH83WX16a80P_K6B9bC2wb</recordid><startdate>20200601</startdate><enddate>20200601</enddate><creator>Munir, Rana Faisal</creator><creator>Abelló, Alberto</creator><creator>Romero, Oscar</creator><creator>Thiele, Maik</creator><creator>Lehner, Wolfgang</creator><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20200601</creationdate><title>Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization</title><author>Munir, Rana Faisal ; Abelló, Alberto ; Romero, Oscar ; Thiele, Maik ; Lehner, Wolfgang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c249t-ddb50a0217aee935938036b722287af0e984177a447408d44be460d14a1f07093</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Munir, Rana Faisal</creatorcontrib><creatorcontrib>Abelló, Alberto</creatorcontrib><creatorcontrib>Romero, Oscar</creatorcontrib><creatorcontrib>Thiele, Maik</creatorcontrib><creatorcontrib>Lehner, Wolfgang</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Big data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Munir, Rana Faisal</au><au>Abelló, Alberto</au><au>Romero, Oscar</au><au>Thiele, Maik</au><au>Lehner, Wolfgang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization</atitle><jtitle>Big data</jtitle><addtitle>Big Data</addtitle><date>2020-06-01</date><risdate>2020</risdate><volume>8</volume><issue>3</issue><spage>235</spage><epage>247</epage><pages>235-247</pages><issn>2167-6461</issn><eissn>2167-647X</eissn><abstract>Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.</abstract><cop>United States</cop><pmid>32397735</pmid><doi>10.1089/big.2019.0068</doi><tpages>13</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 2167-6461
ispartof	Big data, 2020-06, Vol.8 (3), p.235-247
issn	2167-6461 2167-647X
language	eng
recordid	cdi_proquest_miscellaneous_2402441600
source	Alma/SFX Local Collection
title	Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T18%3A15%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Configuring%20Parallelism%20for%20Hybrid%20Layouts%20Using%20Multi-Objective%20Optimization&rft.jtitle=Big%20data&rft.au=Munir,%20Rana%20Faisal&rft.date=2020-06-01&rft.volume=8&rft.issue=3&rft.spage=235&rft.epage=247&rft.pages=235-247&rft.issn=2167-6461&rft.eissn=2167-647X&rft_id=info:doi/10.1089/big.2019.0068&rft_dat=%3Cproquest_cross%3E2402441600%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2402441600&rft_id=info:pmid/32397735&rfr_iscdi=true