Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the cap...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Perera, Niranda, Shan, Kaiying, Kamburugamuwe, Supun, Kanewela, Thejaka Amila, Widanage, Chathura, Sarker, Arup, Staylor, Mills, Zhong, Tianle, Abeykoon, Vibhatha, Fox, Geoffrey
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Perera, Niranda
Shan, Kaiying
Kamburugamuwe, Supun
Kanewela, Thejaka Amila
Widanage, Chathura
Sarker, Arup
Staylor, Mills
Zhong, Tianle
Abeykoon, Vibhatha
Fox, Geoffrey
description The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon & CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.
doi_str_mv 10.48550/arxiv.2301.07896
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2301_07896</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2301_07896</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-3faffabc06bd4d5f68bea10897dbf161b87799bd2892fed854ff34ad07c9b54f3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQof0BX-gQTn5ccSpS1FqgQSFdvoOrZTS8SJbpwK_p60ZTWa0cxIh5B1xtJSVhV7Bvzx5zQvWJYyIRW_J1-f82ixPQF2PnR046eIXs_RGloP_TjHS7oNZ49D6G2IE90NSPe-O9EPi27AHkJr6QYiLLXlw1pcJg_kzsH3ZB__dUWOu-2x3ieH99e3-uWQABc8KRw4B7plXJvSVI5LbSFjUgmjXcYzLYVQSptcqtxZI6vSuaIEw0Sr9GKKFXm63V7BmhF9D_jbXACbK2DxB0obTbU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><source>arXiv.org</source><creator>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</creator><creatorcontrib>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</creatorcontrib><description>The data engineering and data science community has embraced the idea of using Python &amp; R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time &amp; effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon &amp; CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.</description><identifier>DOI: 10.48550/arxiv.2301.07896</identifier><language>eng</language><subject>Computer Science - Databases ; Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2023-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2301.07896$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2301.07896$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Shan, Kaiying</creatorcontrib><creatorcontrib>Kamburugamuwe, Supun</creatorcontrib><creatorcontrib>Kanewela, Thejaka Amila</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Sarker, Arup</creatorcontrib><creatorcontrib>Staylor, Mills</creatorcontrib><creatorcontrib>Zhong, Tianle</creatorcontrib><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><description>The data engineering and data science community has embraced the idea of using Python &amp; R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time &amp; effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon &amp; CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.</description><subject>Computer Science - Databases</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQof0BX-gQTn5ccSpS1FqgQSFdvoOrZTS8SJbpwK_p60ZTWa0cxIh5B1xtJSVhV7Bvzx5zQvWJYyIRW_J1-f82ixPQF2PnR046eIXs_RGloP_TjHS7oNZ49D6G2IE90NSPe-O9EPi27AHkJr6QYiLLXlw1pcJg_kzsH3ZB__dUWOu-2x3ieH99e3-uWQABc8KRw4B7plXJvSVI5LbSFjUgmjXcYzLYVQSptcqtxZI6vSuaIEw0Sr9GKKFXm63V7BmhF9D_jbXACbK2DxB0obTbU</recordid><startdate>20230119</startdate><enddate>20230119</enddate><creator>Perera, Niranda</creator><creator>Shan, Kaiying</creator><creator>Kamburugamuwe, Supun</creator><creator>Kanewela, Thejaka Amila</creator><creator>Widanage, Chathura</creator><creator>Sarker, Arup</creator><creator>Staylor, Mills</creator><creator>Zhong, Tianle</creator><creator>Abeykoon, Vibhatha</creator><creator>Fox, Geoffrey</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230119</creationdate><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><author>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-3faffabc06bd4d5f68bea10897dbf161b87799bd2892fed854ff34ad07c9b54f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Databases</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Shan, Kaiying</creatorcontrib><creatorcontrib>Kamburugamuwe, Supun</creatorcontrib><creatorcontrib>Kanewela, Thejaka Amila</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Sarker, Arup</creatorcontrib><creatorcontrib>Staylor, Mills</creatorcontrib><creatorcontrib>Zhong, Tianle</creatorcontrib><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Perera, Niranda</au><au>Shan, Kaiying</au><au>Kamburugamuwe, Supun</au><au>Kanewela, Thejaka Amila</au><au>Widanage, Chathura</au><au>Sarker, Arup</au><au>Staylor, Mills</au><au>Zhong, Tianle</au><au>Abeykoon, Vibhatha</au><au>Fox, Geoffrey</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Supercharging Distributed Computing Environments For High Performance Data Engineering</atitle><date>2023-01-19</date><risdate>2023</risdate><abstract>The data engineering and data science community has embraced the idea of using Python &amp; R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time &amp; effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon &amp; CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.</abstract><doi>10.48550/arxiv.2301.07896</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2301.07896
ispartof
issn
language eng
recordid cdi_arxiv_primary_2301_07896
source arXiv.org
subjects Computer Science - Databases
Computer Science - Distributed, Parallel, and Cluster Computing
title Supercharging Distributed Computing Environments For High Performance Data Engineering
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T18%3A36%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Supercharging%20Distributed%20Computing%20Environments%20For%20High%20Performance%20Data%20Engineering&rft.au=Perera,%20Niranda&rft.date=2023-01-19&rft_id=info:doi/10.48550/arxiv.2301.07896&rft_dat=%3Carxiv_GOX%3E2301_07896%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true