Distributed Joins and Data Placement for Minimal Network Traffic
Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely...
Gespeichert in:
Veröffentlicht in: | ACM transactions on database systems 2018-11, Vol.43 (3), p.1-45 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 45 |
---|---|
container_issue | 3 |
container_start_page | 1 |
container_title | ACM transactions on database systems |
container_volume | 43 |
creator | Polychroniou, Orestis Zhang, Wangda Ross, Kenneth A. |
description | Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads. |
doi_str_mv | 10.1145/3241039 |
format | Article |
fullrecord | <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3241039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3241039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</originalsourceid><addsrcrecordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><source>Access via ACM Digital Library</source><creator>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creator><creatorcontrib>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</creatorcontrib><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><identifier>ISSN: 0362-5915</identifier><identifier>EISSN: 1557-4644</identifier><identifier>DOI: 10.1145/3241039</identifier><language>eng</language><ispartof>ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</citedby><cites>FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><title>ACM transactions on database systems</title><description>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</description><issn>0362-5915</issn><issn>1557-4644</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNotj7lOAzEUAC0EEiEgfsEd1YKfr113oIRwKBxFqFfPl2TY7CLbCPH3EJFqutEMIefALgGkuhJcAhPmgMxAqbaRWspDMmNC80YZUMfkpJR3xpjsTDsj18tUak72qwZPH6c0Foqjp0usSF8HdGEbxkrjlOlTGtMWB_oc6veUP-gmY4zJnZKjiEMJZ3vOydvqdrO4b9Yvdw-Lm3XjuOpq01mJDIJx1qtOWsUjoAgxcml867TU3iILYpdurdYGOIANnKnoQXDsxJxc_HtdnkrJIfaf-a8n__TA-t14vx8Xv8V9SYc</recordid><startdate>20181101</startdate><enddate>20181101</enddate><creator>Polychroniou, Orestis</creator><creator>Zhang, Wangda</creator><creator>Ross, Kenneth A.</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20181101</creationdate><title>Distributed Joins and Data Placement for Minimal Network Traffic</title><author>Polychroniou, Orestis ; Zhang, Wangda ; Ross, Kenneth A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c258t-8b4a01e9cbd584b52f1a3eff249d7c646dba0e34103bb6691211be205fd132a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Polychroniou, Orestis</creatorcontrib><creatorcontrib>Zhang, Wangda</creatorcontrib><creatorcontrib>Ross, Kenneth A.</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on database systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Polychroniou, Orestis</au><au>Zhang, Wangda</au><au>Ross, Kenneth A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distributed Joins and Data Placement for Minimal Network Traffic</atitle><jtitle>ACM transactions on database systems</jtitle><date>2018-11-01</date><risdate>2018</risdate><volume>43</volume><issue>3</issue><spage>1</spage><epage>45</epage><pages>1-45</pages><issn>0362-5915</issn><eissn>1557-4644</eissn><abstract>Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.</abstract><doi>10.1145/3241039</doi><tpages>45</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0362-5915 |
ispartof | ACM transactions on database systems, 2018-11, Vol.43 (3), p.1-45 |
issn | 0362-5915 1557-4644 |
language | eng |
recordid | cdi_crossref_primary_10_1145_3241039 |
source | Access via ACM Digital Library |
title | Distributed Joins and Data Placement for Minimal Network Traffic |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A18%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distributed%20Joins%20and%20Data%20Placement%20for%20Minimal%20Network%20Traffic&rft.jtitle=ACM%20transactions%20on%20database%20systems&rft.au=Polychroniou,%20Orestis&rft.date=2018-11-01&rft.volume=43&rft.issue=3&rft.spage=1&rft.epage=45&rft.pages=1-45&rft.issn=0362-5915&rft.eissn=1557-4644&rft_id=info:doi/10.1145/3241039&rft_dat=%3Ccrossref%3E10_1145_3241039%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |