Distributed Joins and Data Placement for Minimal Network Traffic

Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on database systems 2018-11, Vol.43 (3), p.1-45
Hauptverfasser:	Polychroniou, Orestis, Zhang, Wangda, Ross, Kenneth A.
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	45
container_issue	3
container_start_page	1
container_title	ACM transactions on database systems
container_volume	43
creator	Polychroniou, Orestis Zhang, Wangda Ross, Kenneth A.
description	Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.
doi_str_mv	10.1145/3241039
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3241039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3241039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</originalsourceid><addsrcrecordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><source>Access via ACM Digital Library</source><creator>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creator><creatorcontrib>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creatorcontrib><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><identifier>ISSN: 0362-5915</identifier><identifier>EISSN: 1557-4644</identifier><identifier>DOI: 10.1145/3241039</identifier><language>eng</language><ispartof>ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</citedby><cites>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><title>ACM transactions on database systems</title><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><issn>0362-5915</issn><issn>1557-4644</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</recordid><startdate>20181101</startdate><enddate>20181101</enddate><creator>Polychroniou, Orestis</creator><creator>Zhang, Wangda</creator><creator>Ross, Kenneth A.</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20181101</creationdate><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><author>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on database systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Polychroniou, Orestis</au><au>Zhang, Wangda</au><au>Ross, Kenneth A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distributed Joins and Data Placement for Minimal Network Traffic</atitle><jtitle>ACM transactions on database systems</jtitle><date>2018-11-01</date><risdate>2018</risdate><volume>43</volume><issue>3</issue><spage>1</spage><epage>45</epage><pages>1-45</pages><issn>0362-5915</issn><eissn>1557-4644</eissn><abstract>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</abstract><doi>10.1145/3241039</doi><tpages>45</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0362-5915
ispartof	ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45
issn	0362-5915 1557-4644
language	eng
recordid	cdi_crossref_primary_10_1145_3241039
source	Access via ACM Digital Library
title	Distributed Joins and Data Placement for Minimal Network Traffic
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A18%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distributed%20Joins%20and%20Data%20Placement%20for%20Minimal%20Network%20Traffic&rft.jtitle=ACM%20transactions%20on%20database%20systems&rft.au=Polychroniou,%20Orestis&rft.date=2018-11-01&rft.volume=43&rft.issue=3&rft.spage=1&rft.epage=45&rft.pages=1-45&rft.issn=0362-5915&rft.eissn=1557-4644&rft_id=info:doi/10.1145/3241039&rft_dat=%3Ccrossref%3E10_1145_3241039%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true