Supercharging Distributed Computing Environments For High Performance Data Engineering
The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the cap...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Perera, Niranda Shan, Kaiying Kamburugamuwe, Supun Kanewela, Thejaka Amila Widanage, Chathura Sarker, Arup Staylor, Mills Zhong, Tianle Abeykoon, Vibhatha Fox, Geoffrey |
description | The data engineering and data science community has embraced the idea of
using Python & R dataframes for regular applications. Driven by the big data
revolution and artificial intelligence, these applications are now essential in
order to process terabytes of data. They can easily exceed the capabilities of
a single machine, but also demand significant developer time & effort.
Therefore it is essential to design scalable dataframe solutions. There have
been multiple attempts to tackle this problem, the most notable being the
dataframe systems developed using distributed computing environments such as
Dask and Ray. Even though Dask/Ray distributed computing features look very
promising, we perceive that the Dask Dataframes/Ray Datasets still have room
for optimization. In this paper, we present CylonFlow, an alternative
distributed dataframe execution methodology that enables state-of-the-art
performance and scalability on the same Dask/Ray infrastructure (thereby
supercharging them!). To achieve this, we integrate a high performance
dataframe system Cylon, which was originally based on an entirely different
execution paradigm, into Dask and Ray. Our experiments show that on a pipeline
of dataframe operators, CylonFlow achieves 30x more distributed performance
than Dask Dataframes. Interestingly, it also enables superior sequential
performance due to the native C++ execution of Cylon. We believe the success of
Cylon & CylonFlow extends beyond the data engineering domain, and can be used
to consolidate high performance computing and distributed computing ecosystems. |
doi_str_mv | 10.48550/arxiv.2301.07896 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2301_07896</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2301_07896</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-3faffabc06bd4d5f68bea10897dbf161b87799bd2892fed854ff34ad07c9b54f3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQof0BX-gQTn5ccSpS1FqgQSFdvoOrZTS8SJbpwK_p60ZTWa0cxIh5B1xtJSVhV7Bvzx5zQvWJYyIRW_J1-f82ixPQF2PnR046eIXs_RGloP_TjHS7oNZ49D6G2IE90NSPe-O9EPi27AHkJr6QYiLLXlw1pcJg_kzsH3ZB__dUWOu-2x3ieH99e3-uWQABc8KRw4B7plXJvSVI5LbSFjUgmjXcYzLYVQSptcqtxZI6vSuaIEw0Sr9GKKFXm63V7BmhF9D_jbXACbK2DxB0obTbU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><source>arXiv.org</source><creator>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</creator><creatorcontrib>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</creatorcontrib><description>The data engineering and data science community has embraced the idea of
using Python & R dataframes for regular applications. Driven by the big data
revolution and artificial intelligence, these applications are now essential in
order to process terabytes of data. They can easily exceed the capabilities of
a single machine, but also demand significant developer time & effort.
Therefore it is essential to design scalable dataframe solutions. There have
been multiple attempts to tackle this problem, the most notable being the
dataframe systems developed using distributed computing environments such as
Dask and Ray. Even though Dask/Ray distributed computing features look very
promising, we perceive that the Dask Dataframes/Ray Datasets still have room
for optimization. In this paper, we present CylonFlow, an alternative
distributed dataframe execution methodology that enables state-of-the-art
performance and scalability on the same Dask/Ray infrastructure (thereby
supercharging them!). To achieve this, we integrate a high performance
dataframe system Cylon, which was originally based on an entirely different
execution paradigm, into Dask and Ray. Our experiments show that on a pipeline
of dataframe operators, CylonFlow achieves 30x more distributed performance
than Dask Dataframes. Interestingly, it also enables superior sequential
performance due to the native C++ execution of Cylon. We believe the success of
Cylon & CylonFlow extends beyond the data engineering domain, and can be used
to consolidate high performance computing and distributed computing ecosystems.</description><identifier>DOI: 10.48550/arxiv.2301.07896</identifier><language>eng</language><subject>Computer Science - Databases ; Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2023-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2301.07896$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2301.07896$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Shan, Kaiying</creatorcontrib><creatorcontrib>Kamburugamuwe, Supun</creatorcontrib><creatorcontrib>Kanewela, Thejaka Amila</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Sarker, Arup</creatorcontrib><creatorcontrib>Staylor, Mills</creatorcontrib><creatorcontrib>Zhong, Tianle</creatorcontrib><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><description>The data engineering and data science community has embraced the idea of
using Python & R dataframes for regular applications. Driven by the big data
revolution and artificial intelligence, these applications are now essential in
order to process terabytes of data. They can easily exceed the capabilities of
a single machine, but also demand significant developer time & effort.
Therefore it is essential to design scalable dataframe solutions. There have
been multiple attempts to tackle this problem, the most notable being the
dataframe systems developed using distributed computing environments such as
Dask and Ray. Even though Dask/Ray distributed computing features look very
promising, we perceive that the Dask Dataframes/Ray Datasets still have room
for optimization. In this paper, we present CylonFlow, an alternative
distributed dataframe execution methodology that enables state-of-the-art
performance and scalability on the same Dask/Ray infrastructure (thereby
supercharging them!). To achieve this, we integrate a high performance
dataframe system Cylon, which was originally based on an entirely different
execution paradigm, into Dask and Ray. Our experiments show that on a pipeline
of dataframe operators, CylonFlow achieves 30x more distributed performance
than Dask Dataframes. Interestingly, it also enables superior sequential
performance due to the native C++ execution of Cylon. We believe the success of
Cylon & CylonFlow extends beyond the data engineering domain, and can be used
to consolidate high performance computing and distributed computing ecosystems.</description><subject>Computer Science - Databases</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQof0BX-gQTn5ccSpS1FqgQSFdvoOrZTS8SJbpwK_p60ZTWa0cxIh5B1xtJSVhV7Bvzx5zQvWJYyIRW_J1-f82ixPQF2PnR046eIXs_RGloP_TjHS7oNZ49D6G2IE90NSPe-O9EPi27AHkJr6QYiLLXlw1pcJg_kzsH3ZB__dUWOu-2x3ieH99e3-uWQABc8KRw4B7plXJvSVI5LbSFjUgmjXcYzLYVQSptcqtxZI6vSuaIEw0Sr9GKKFXm63V7BmhF9D_jbXACbK2DxB0obTbU</recordid><startdate>20230119</startdate><enddate>20230119</enddate><creator>Perera, Niranda</creator><creator>Shan, Kaiying</creator><creator>Kamburugamuwe, Supun</creator><creator>Kanewela, Thejaka Amila</creator><creator>Widanage, Chathura</creator><creator>Sarker, Arup</creator><creator>Staylor, Mills</creator><creator>Zhong, Tianle</creator><creator>Abeykoon, Vibhatha</creator><creator>Fox, Geoffrey</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230119</creationdate><title>Supercharging Distributed Computing Environments For High Performance Data Engineering</title><author>Perera, Niranda ; Shan, Kaiying ; Kamburugamuwe, Supun ; Kanewela, Thejaka Amila ; Widanage, Chathura ; Sarker, Arup ; Staylor, Mills ; Zhong, Tianle ; Abeykoon, Vibhatha ; Fox, Geoffrey</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-3faffabc06bd4d5f68bea10897dbf161b87799bd2892fed854ff34ad07c9b54f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Databases</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Shan, Kaiying</creatorcontrib><creatorcontrib>Kamburugamuwe, Supun</creatorcontrib><creatorcontrib>Kanewela, Thejaka Amila</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Sarker, Arup</creatorcontrib><creatorcontrib>Staylor, Mills</creatorcontrib><creatorcontrib>Zhong, Tianle</creatorcontrib><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Perera, Niranda</au><au>Shan, Kaiying</au><au>Kamburugamuwe, Supun</au><au>Kanewela, Thejaka Amila</au><au>Widanage, Chathura</au><au>Sarker, Arup</au><au>Staylor, Mills</au><au>Zhong, Tianle</au><au>Abeykoon, Vibhatha</au><au>Fox, Geoffrey</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Supercharging Distributed Computing Environments For High Performance Data Engineering</atitle><date>2023-01-19</date><risdate>2023</risdate><abstract>The data engineering and data science community has embraced the idea of
using Python & R dataframes for regular applications. Driven by the big data
revolution and artificial intelligence, these applications are now essential in
order to process terabytes of data. They can easily exceed the capabilities of
a single machine, but also demand significant developer time & effort.
Therefore it is essential to design scalable dataframe solutions. There have
been multiple attempts to tackle this problem, the most notable being the
dataframe systems developed using distributed computing environments such as
Dask and Ray. Even though Dask/Ray distributed computing features look very
promising, we perceive that the Dask Dataframes/Ray Datasets still have room
for optimization. In this paper, we present CylonFlow, an alternative
distributed dataframe execution methodology that enables state-of-the-art
performance and scalability on the same Dask/Ray infrastructure (thereby
supercharging them!). To achieve this, we integrate a high performance
dataframe system Cylon, which was originally based on an entirely different
execution paradigm, into Dask and Ray. Our experiments show that on a pipeline
of dataframe operators, CylonFlow achieves 30x more distributed performance
than Dask Dataframes. Interestingly, it also enables superior sequential
performance due to the native C++ execution of Cylon. We believe the success of
Cylon & CylonFlow extends beyond the data engineering domain, and can be used
to consolidate high performance computing and distributed computing ecosystems.</abstract><doi>10.48550/arxiv.2301.07896</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2301.07896 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2301_07896 |
source | arXiv.org |
subjects | Computer Science - Databases Computer Science - Distributed, Parallel, and Cluster Computing |
title | Supercharging Distributed Computing Environments For High Performance Data Engineering |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T18%3A36%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Supercharging%20Distributed%20Computing%20Environments%20For%20High%20Performance%20Data%20Engineering&rft.au=Perera,%20Niranda&rft.date=2023-01-19&rft_id=info:doi/10.48550/arxiv.2301.07896&rft_dat=%3Carxiv_GOX%3E2301_07896%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |