Efficient spatial data partitioning for distributed kNN joins

Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of big data 2022-12, Vol.9 (1)
Hauptverfasser: Zeidan, Ayman, Vo, Huy T.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 1
container_start_page
container_title Journal of big data
container_volume 9
creator Zeidan, Ayman
Vo, Huy T.
description Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate k NN spatial join query. Experimental tests show up to 1.48 times improvement in runtime as well as the accuracy of results.
doi_str_mv 10.1186/s40537-022-00587-2
format Article
fullrecord <record><control><sourceid>proquest_sprin</sourceid><recordid>TN_cdi_proquest_journals_2672493028</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2672493028</sourcerecordid><originalsourceid>FETCH-LOGICAL-p722-94dc7fcda2e2cede7225297d2728dacd02edb4997ded15bc39777240672492bd3</originalsourceid><addsrcrecordid>eNpFkE1LxDAQhoMguKz7BzwFPEeTSds0Bw-yrB-wrJe9hzSTSurS1iT9_8ZdwcsMvDzMvDyE3An-IETbPKaK11IxDsA4r1vF4IqsQOiGCSHqG7JJaeCcC1m4plqRp13fBxf8mGmabQ72RNFmS2cbc8hhGsP4SfspUgwpx9At2SP9OhzoMIUx3ZLr3p6S3_ztNTm-7I7bN7b_eH3fPu_ZrEoRXaFTvUMLHpxHX7IatEJQ0KJ1yMFjV-mSeBR156RWSkHFmzI0dCjX5P5ydo7T9-JTNsO0xLF8NHCGJIe2UPJCpTmW1j7-U4KbXzvmYscUO-Zsx4D8AU--WVg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2672493028</pqid></control><display><type>article</type><title>Efficient spatial data partitioning for distributed kNN joins</title><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Alma/SFX Local Collection</source><source>SpringerLink Journals - AutoHoldings</source><source>Springer Nature OA Free Journals</source><creator>Zeidan, Ayman ; Vo, Huy T.</creator><creatorcontrib>Zeidan, Ayman ; Vo, Huy T.</creatorcontrib><description>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate k NN spatial join query. Experimental tests show up to 1.48 times improvement in runtime as well as the accuracy of results.</description><identifier>EISSN: 2196-1115</identifier><identifier>DOI: 10.1186/s40537-022-00587-2</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Big Data ; Communications Engineering ; Computational Science and Engineering ; Computer networks ; Computer Science ; Data Mining and Knowledge Discovery ; Data sampling ; Database Management ; Datasets ; Distributed memory ; Distributed processing ; Information Storage and Retrieval ; Mathematical Applications in Computer Science ; Networks ; Parallel processing ; Partitioning ; Query processing ; Run time (computers) ; Spatial data</subject><ispartof>Journal of big data, 2022-12, Vol.9 (1)</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-2881-5047</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1186/s40537-022-00587-2$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1186/s40537-022-00587-2$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,27901,27902,41096,41464,42165,42533,51294,51551</link.rule.ids></links><search><creatorcontrib>Zeidan, Ayman</creatorcontrib><creatorcontrib>Vo, Huy T.</creatorcontrib><title>Efficient spatial data partitioning for distributed kNN joins</title><title>Journal of big data</title><addtitle>J Big Data</addtitle><description>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate k NN spatial join query. Experimental tests show up to 1.48 times improvement in runtime as well as the accuracy of results.</description><subject>Big Data</subject><subject>Communications Engineering</subject><subject>Computational Science and Engineering</subject><subject>Computer networks</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Data sampling</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Distributed memory</subject><subject>Distributed processing</subject><subject>Information Storage and Retrieval</subject><subject>Mathematical Applications in Computer Science</subject><subject>Networks</subject><subject>Parallel processing</subject><subject>Partitioning</subject><subject>Query processing</subject><subject>Run time (computers)</subject><subject>Spatial data</subject><issn>2196-1115</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>BENPR</sourceid><recordid>eNpFkE1LxDAQhoMguKz7BzwFPEeTSds0Bw-yrB-wrJe9hzSTSurS1iT9_8ZdwcsMvDzMvDyE3An-IETbPKaK11IxDsA4r1vF4IqsQOiGCSHqG7JJaeCcC1m4plqRp13fBxf8mGmabQ72RNFmS2cbc8hhGsP4SfspUgwpx9At2SP9OhzoMIUx3ZLr3p6S3_ztNTm-7I7bN7b_eH3fPu_ZrEoRXaFTvUMLHpxHX7IatEJQ0KJ1yMFjV-mSeBR156RWSkHFmzI0dCjX5P5ydo7T9-JTNsO0xLF8NHCGJIe2UPJCpTmW1j7-U4KbXzvmYscUO-Zsx4D8AU--WVg</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Zeidan, Ayman</creator><creator>Vo, Huy T.</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>0-V</scope><scope>3V.</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>88J</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>M0C</scope><scope>M0N</scope><scope>M2R</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-2881-5047</orcidid></search><sort><creationdate>20221201</creationdate><title>Efficient spatial data partitioning for distributed kNN joins</title><author>Zeidan, Ayman ; Vo, Huy T.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p722-94dc7fcda2e2cede7225297d2728dacd02edb4997ded15bc39777240672492bd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Big Data</topic><topic>Communications Engineering</topic><topic>Computational Science and Engineering</topic><topic>Computer networks</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Data sampling</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Distributed memory</topic><topic>Distributed processing</topic><topic>Information Storage and Retrieval</topic><topic>Mathematical Applications in Computer Science</topic><topic>Networks</topic><topic>Parallel processing</topic><topic>Partitioning</topic><topic>Query processing</topic><topic>Run time (computers)</topic><topic>Spatial data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zeidan, Ayman</creatorcontrib><creatorcontrib>Vo, Huy T.</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>ProQuest Social Sciences Premium Collection</collection><collection>ProQuest Central (Corporate)</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Social Science Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Social Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Journal of big data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zeidan, Ayman</au><au>Vo, Huy T.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient spatial data partitioning for distributed kNN joins</atitle><jtitle>Journal of big data</jtitle><stitle>J Big Data</stitle><date>2022-12-01</date><risdate>2022</risdate><volume>9</volume><issue>1</issue><eissn>2196-1115</eissn><abstract>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate k NN spatial join query. Experimental tests show up to 1.48 times improvement in runtime as well as the accuracy of results.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1186/s40537-022-00587-2</doi><orcidid>https://orcid.org/0000-0002-2881-5047</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2196-1115
ispartof Journal of big data, 2022-12, Vol.9 (1)
issn 2196-1115
language eng
recordid cdi_proquest_journals_2672493028
source DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; Alma/SFX Local Collection; SpringerLink Journals - AutoHoldings; Springer Nature OA Free Journals
subjects Big Data
Communications Engineering
Computational Science and Engineering
Computer networks
Computer Science
Data Mining and Knowledge Discovery
Data sampling
Database Management
Datasets
Distributed memory
Distributed processing
Information Storage and Retrieval
Mathematical Applications in Computer Science
Networks
Parallel processing
Partitioning
Query processing
Run time (computers)
Spatial data
title Efficient spatial data partitioning for distributed kNN joins
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T23%3A50%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_sprin&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20spatial%20data%20partitioning%20for%20distributed%20kNN%20joins&rft.jtitle=Journal%20of%20big%20data&rft.au=Zeidan,%20Ayman&rft.date=2022-12-01&rft.volume=9&rft.issue=1&rft.eissn=2196-1115&rft_id=info:doi/10.1186/s40537-022-00587-2&rft_dat=%3Cproquest_sprin%3E2672493028%3C/proquest_sprin%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2672493028&rft_id=info:pmid/&rfr_iscdi=true