Distributed Joins and Data Placement for Minimal Network Traffic

Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on database systems 2018-11, Vol.43 (3), p.1-45
Hauptverfasser: Polychroniou, Orestis, Zhang, Wangda, Ross, Kenneth A.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 45
container_issue 3
container_start_page 1
container_title ACM transactions on database systems
container_volume 43
creator Polychroniou, Orestis
Zhang, Wangda
Ross, Kenneth A.
description Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.
doi_str_mv 10.1145/3241039
format Article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3241039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3241039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</originalsourceid><addsrcrecordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><source>Access via ACM Digital Library</source><creator>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creator><creatorcontrib>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creatorcontrib><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><identifier>ISSN: 0362-5915</identifier><identifier>EISSN: 1557-4644</identifier><identifier>DOI: 10.1145/3241039</identifier><language>eng</language><ispartof>ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</citedby><cites>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><title>ACM transactions on database systems</title><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><issn>0362-5915</issn><issn>1557-4644</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</recordid><startdate>20181101</startdate><enddate>20181101</enddate><creator>Polychroniou, Orestis</creator><creator>Zhang, Wangda</creator><creator>Ross, Kenneth A.</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20181101</creationdate><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><author>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on database systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Polychroniou, Orestis</au><au>Zhang, Wangda</au><au>Ross, Kenneth A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distributed Joins and Data Placement for Minimal Network Traffic</atitle><jtitle>ACM transactions on database systems</jtitle><date>2018-11-01</date><risdate>2018</risdate><volume>43</volume><issue>3</issue><spage>1</spage><epage>45</epage><pages>1-45</pages><issn>0362-5915</issn><eissn>1557-4644</eissn><abstract>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</abstract><doi>10.1145/3241039</doi><tpages>45</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0362-5915
ispartof ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45
issn 0362-5915
1557-4644
language eng
recordid cdi_crossref_primary_10_1145_3241039
source Access via ACM Digital Library
title Distributed Joins and Data Placement for Minimal Network Traffic
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A18%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distributed%20Joins%20and%20Data%20Placement%20for%20Minimal%20Network%20Traffic&rft.jtitle=ACM%20transactions%20on%20database%20systems&rft.au=Polychroniou,%20Orestis&rft.date=2018-11-01&rft.volume=43&rft.issue=3&rft.spage=1&rft.epage=45&rft.pages=1-45&rft.issn=0362-5915&rft.eissn=1557-4644&rft_id=info:doi/10.1145/3241039&rft_dat=%3Ccrossref%3E10_1145_3241039%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true