A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of digital earth 2020-03, Vol.13 (3), p.410-428
Hauptverfasser: Hu, Fei, Yang, Chaowei, Jiang, Yongyao, Li, Yun, Song, Weiwei, Duffy, Daniel Q., Schnase, John L., Lee, Tsengdar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 428
container_issue 3
container_start_page 410
container_title International journal of digital earth
container_volume 13
creator Hu, Fei
Yang, Chaowei
Jiang, Yongyao
Li, Yun
Song, Weiwei
Duffy, Daniel Q.
Schnase, John L.
Lee, Tsengdar
description Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.
doi_str_mv 10.1080/17538947.2018.1523957
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1080_17538947_2018_1523957</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_55ec0b4052244cffb8f279db316b35b4</doaj_id><sourcerecordid>2353261208</sourcerecordid><originalsourceid>FETCH-LOGICAL-c404t-3e5e49157a40dced5465c946be8d70b183e96ed093eb1a820dcb09fde2a0ea883</originalsourceid><addsrcrecordid>eNp9kU1v1DAQhiMEEqXwE5Ascd7Fn4lzY1UorVSph8LZGtvjrJdsHByXkv56Erb0yGlGr2ae-Xir6j2jW0Y1_cgaJXQrmy2nTG-Z4qJVzYvqbNU3ulXq5XMum9fVm2k6UFpTKcVZ9WtH9hEzZLePDnoSB4-_49CRqWQo2M0kpEzSWOIxPq76bgS3R3I3Qv5BHmLZk6vPl3ekJIIhRBdxKP1Mft5jnomNHekwTSOUuLAzTAUz8VDgbfUqQD_hu6d4Xn2__PLt4mpzc_v1-mJ3s3GSyrIRqFC2TDUgqXfolayVa2VtUfuGWqYFtjV62gq0DDRfiixtg0cOFEFrcV5dn7g-wcGMOR4hzyZBNH-FlDsDuUTXo1EKHbWSKs6ldCFYHXjTeitYbYWycmF9OLHGnJb7pmIO6T4Py_qGCyV4zThdJ6pTlctpmjKG56mMmtUu888us9plnuxa-j6d-uKwfPwIDyn33hSY-5RDhsHFyYj_I_4AGvCdSA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2353261208</pqid></control><display><type>article</type><title>A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Hu, Fei ; Yang, Chaowei ; Jiang, Yongyao ; Li, Yun ; Song, Weiwei ; Duffy, Daniel Q. ; Schnase, John L. ; Lee, Tsengdar</creator><creatorcontrib>Hu, Fei ; Yang, Chaowei ; Jiang, Yongyao ; Li, Yun ; Song, Weiwei ; Duffy, Daniel Q. ; Schnase, John L. ; Lee, Tsengdar</creatorcontrib><description>Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.</description><identifier>ISSN: 1753-8947</identifier><identifier>EISSN: 1753-8955</identifier><identifier>DOI: 10.1080/17538947.2018.1523957</identifier><language>eng</language><publisher>Abingdon: Taylor &amp; Francis</publisher><subject>Apache Spark ; Big data ; Computer networks ; Computer simulation ; Data ; Data storage ; Data structures ; distributed computing ; Distributed processing ; Earth ; GIS ; HDFS ; hierarchical indexing ; Indexing ; Information storage ; multi-dimensional ; Optimization ; Queries ; Query languages ; Raster ; Spatial data ; Strategy ; Submarine pipelines ; Workload</subject><ispartof>International journal of digital earth, 2020-03, Vol.13 (3), p.410-428</ispartof><rights>2018 Informa UK Limited, trading as Taylor &amp; Francis Group 2018</rights><rights>2018 Informa UK Limited, trading as Taylor &amp; Francis Group</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c404t-3e5e49157a40dced5465c946be8d70b183e96ed093eb1a820dcb09fde2a0ea883</citedby><cites>FETCH-LOGICAL-c404t-3e5e49157a40dced5465c946be8d70b183e96ed093eb1a820dcb09fde2a0ea883</cites><orcidid>0000-0001-7768-4066</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Hu, Fei</creatorcontrib><creatorcontrib>Yang, Chaowei</creatorcontrib><creatorcontrib>Jiang, Yongyao</creatorcontrib><creatorcontrib>Li, Yun</creatorcontrib><creatorcontrib>Song, Weiwei</creatorcontrib><creatorcontrib>Duffy, Daniel Q.</creatorcontrib><creatorcontrib>Schnase, John L.</creatorcontrib><creatorcontrib>Lee, Tsengdar</creatorcontrib><title>A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data</title><title>International journal of digital earth</title><description>Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.</description><subject>Apache Spark</subject><subject>Big data</subject><subject>Computer networks</subject><subject>Computer simulation</subject><subject>Data</subject><subject>Data storage</subject><subject>Data structures</subject><subject>distributed computing</subject><subject>Distributed processing</subject><subject>Earth</subject><subject>GIS</subject><subject>HDFS</subject><subject>hierarchical indexing</subject><subject>Indexing</subject><subject>Information storage</subject><subject>multi-dimensional</subject><subject>Optimization</subject><subject>Queries</subject><subject>Query languages</subject><subject>Raster</subject><subject>Spatial data</subject><subject>Strategy</subject><subject>Submarine pipelines</subject><subject>Workload</subject><issn>1753-8947</issn><issn>1753-8955</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNp9kU1v1DAQhiMEEqXwE5Ascd7Fn4lzY1UorVSph8LZGtvjrJdsHByXkv56Erb0yGlGr2ae-Xir6j2jW0Y1_cgaJXQrmy2nTG-Z4qJVzYvqbNU3ulXq5XMum9fVm2k6UFpTKcVZ9WtH9hEzZLePDnoSB4-_49CRqWQo2M0kpEzSWOIxPq76bgS3R3I3Qv5BHmLZk6vPl3ekJIIhRBdxKP1Mft5jnomNHekwTSOUuLAzTAUz8VDgbfUqQD_hu6d4Xn2__PLt4mpzc_v1-mJ3s3GSyrIRqFC2TDUgqXfolayVa2VtUfuGWqYFtjV62gq0DDRfiixtg0cOFEFrcV5dn7g-wcGMOR4hzyZBNH-FlDsDuUTXo1EKHbWSKs6ldCFYHXjTeitYbYWycmF9OLHGnJb7pmIO6T4Py_qGCyV4zThdJ6pTlctpmjKG56mMmtUu888us9plnuxa-j6d-uKwfPwIDyn33hSY-5RDhsHFyYj_I_4AGvCdSA</recordid><startdate>20200303</startdate><enddate>20200303</enddate><creator>Hu, Fei</creator><creator>Yang, Chaowei</creator><creator>Jiang, Yongyao</creator><creator>Li, Yun</creator><creator>Song, Weiwei</creator><creator>Duffy, Daniel Q.</creator><creator>Schnase, John L.</creator><creator>Lee, Tsengdar</creator><general>Taylor &amp; Francis</general><general>Taylor &amp; Francis Ltd</general><general>Taylor &amp; Francis Group</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7ST</scope><scope>7UA</scope><scope>8FD</scope><scope>C1K</scope><scope>F1W</scope><scope>FR3</scope><scope>H8D</scope><scope>H96</scope><scope>KR7</scope><scope>L.G</scope><scope>L7M</scope><scope>SOI</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-7768-4066</orcidid></search><sort><creationdate>20200303</creationdate><title>A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data</title><author>Hu, Fei ; Yang, Chaowei ; Jiang, Yongyao ; Li, Yun ; Song, Weiwei ; Duffy, Daniel Q. ; Schnase, John L. ; Lee, Tsengdar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c404t-3e5e49157a40dced5465c946be8d70b183e96ed093eb1a820dcb09fde2a0ea883</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Apache Spark</topic><topic>Big data</topic><topic>Computer networks</topic><topic>Computer simulation</topic><topic>Data</topic><topic>Data storage</topic><topic>Data structures</topic><topic>distributed computing</topic><topic>Distributed processing</topic><topic>Earth</topic><topic>GIS</topic><topic>HDFS</topic><topic>hierarchical indexing</topic><topic>Indexing</topic><topic>Information storage</topic><topic>multi-dimensional</topic><topic>Optimization</topic><topic>Queries</topic><topic>Query languages</topic><topic>Raster</topic><topic>Spatial data</topic><topic>Strategy</topic><topic>Submarine pipelines</topic><topic>Workload</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hu, Fei</creatorcontrib><creatorcontrib>Yang, Chaowei</creatorcontrib><creatorcontrib>Jiang, Yongyao</creatorcontrib><creatorcontrib>Li, Yun</creatorcontrib><creatorcontrib>Song, Weiwei</creatorcontrib><creatorcontrib>Duffy, Daniel Q.</creatorcontrib><creatorcontrib>Schnase, John L.</creatorcontrib><creatorcontrib>Lee, Tsengdar</creatorcontrib><collection>CrossRef</collection><collection>Environment Abstracts</collection><collection>Water Resources Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ASFA: Aquatic Sciences and Fisheries Abstracts</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Aquatic Science &amp; Fisheries Abstracts (ASFA) 2: Ocean Technology, Policy &amp; Non-Living Resources</collection><collection>Civil Engineering Abstracts</collection><collection>Aquatic Science &amp; Fisheries Abstracts (ASFA) Professional</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Environment Abstracts</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>International journal of digital earth</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hu, Fei</au><au>Yang, Chaowei</au><au>Jiang, Yongyao</au><au>Li, Yun</au><au>Song, Weiwei</au><au>Duffy, Daniel Q.</au><au>Schnase, John L.</au><au>Lee, Tsengdar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data</atitle><jtitle>International journal of digital earth</jtitle><date>2020-03-03</date><risdate>2020</risdate><volume>13</volume><issue>3</issue><spage>410</spage><epage>428</epage><pages>410-428</pages><issn>1753-8947</issn><eissn>1753-8955</eissn><abstract>Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.</abstract><cop>Abingdon</cop><pub>Taylor &amp; Francis</pub><doi>10.1080/17538947.2018.1523957</doi><tpages>19</tpages><orcidid>https://orcid.org/0000-0001-7768-4066</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1753-8947
ispartof International journal of digital earth, 2020-03, Vol.13 (3), p.410-428
issn 1753-8947
1753-8955
language eng
recordid cdi_crossref_primary_10_1080_17538947_2018_1523957
source EZB-FREE-00999 freely available EZB journals
subjects Apache Spark
Big data
Computer networks
Computer simulation
Data
Data storage
Data structures
distributed computing
Distributed processing
Earth
GIS
HDFS
hierarchical indexing
Indexing
Information storage
multi-dimensional
Optimization
Queries
Query languages
Raster
Spatial data
Strategy
Submarine pipelines
Workload
title A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T14%3A09%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20hierarchical%20indexing%20strategy%20for%20optimizing%20Apache%20Spark%20with%20HDFS%20to%20efficiently%20query%20big%20geospatial%20raster%20data&rft.jtitle=International%20journal%20of%20digital%20earth&rft.au=Hu,%20Fei&rft.date=2020-03-03&rft.volume=13&rft.issue=3&rft.spage=410&rft.epage=428&rft.pages=410-428&rft.issn=1753-8947&rft.eissn=1753-8955&rft_id=info:doi/10.1080/17538947.2018.1523957&rft_dat=%3Cproquest_cross%3E2353261208%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2353261208&rft_id=info:pmid/&rft_doaj_id=oai_doaj_org_article_55ec0b4052244cffb8f279db316b35b4&rfr_iscdi=true