RumbleML: program the lakehouse with JSONiq

Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces Rumble...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Fourny, Ghislain, Dao, David, Cikis, Can Berker, Zhang, Ce, Alonso, Gustavo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Fourny, Ghislain
Dao, David
Cikis, Can Berker
Zhang, Ce
Alonso, Gustavo
description Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale.
doi_str_mv 10.48550/arxiv.2112.12638
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2112_12638</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2112_12638</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-d849d93bc37dc4383c2a8f31a140c0b58c07ffbac9745e9f1a7f959729cfaa363</originalsourceid><addsrcrecordid>eNotzr1uwjAUQGEvHSrKA3TCO0pq-9qx3Q2hAkUpSMAe3Tg2iZoIMH_t2yNop7MdfYS8cpZKoxR7w_jTXFLBuUi5yMA8k-Hq3JWt_8rf6T7uthE7eqo9bfHb17vz0dNrc6rpfL1cNIcX8hSwPfr-f3tkM_nYjGdJvpx-jkd5gpk2SWWkrSyUDnTlJBhwAk0Ajlwyx0plHNMhlOislsrbwFEHq6wW1gVEyKBHBn_bh7bYx6bD-Fvc1cVDDTdw5TvC</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>RumbleML: program the lakehouse with JSONiq</title><source>arXiv.org</source><creator>Fourny, Ghislain ; Dao, David ; Cikis, Can Berker ; Zhang, Ce ; Alonso, Gustavo</creator><creatorcontrib>Fourny, Ghislain ; Dao, David ; Cikis, Can Berker ; Zhang, Ce ; Alonso, Gustavo</creatorcontrib><description>Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale.</description><identifier>DOI: 10.48550/arxiv.2112.12638</identifier><language>eng</language><subject>Computer Science - Databases</subject><creationdate>2021-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2112.12638$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2112.12638$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fourny, Ghislain</creatorcontrib><creatorcontrib>Dao, David</creatorcontrib><creatorcontrib>Cikis, Can Berker</creatorcontrib><creatorcontrib>Zhang, Ce</creatorcontrib><creatorcontrib>Alonso, Gustavo</creatorcontrib><title>RumbleML: program the lakehouse with JSONiq</title><description>Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale.</description><subject>Computer Science - Databases</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr1uwjAUQGEvHSrKA3TCO0pq-9qx3Q2hAkUpSMAe3Tg2iZoIMH_t2yNop7MdfYS8cpZKoxR7w_jTXFLBuUi5yMA8k-Hq3JWt_8rf6T7uthE7eqo9bfHb17vz0dNrc6rpfL1cNIcX8hSwPfr-f3tkM_nYjGdJvpx-jkd5gpk2SWWkrSyUDnTlJBhwAk0Ajlwyx0plHNMhlOislsrbwFEHq6wW1gVEyKBHBn_bh7bYx6bD-Fvc1cVDDTdw5TvC</recordid><startdate>20211223</startdate><enddate>20211223</enddate><creator>Fourny, Ghislain</creator><creator>Dao, David</creator><creator>Cikis, Can Berker</creator><creator>Zhang, Ce</creator><creator>Alonso, Gustavo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211223</creationdate><title>RumbleML: program the lakehouse with JSONiq</title><author>Fourny, Ghislain ; Dao, David ; Cikis, Can Berker ; Zhang, Ce ; Alonso, Gustavo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-d849d93bc37dc4383c2a8f31a140c0b58c07ffbac9745e9f1a7f959729cfaa363</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Databases</topic><toplevel>online_resources</toplevel><creatorcontrib>Fourny, Ghislain</creatorcontrib><creatorcontrib>Dao, David</creatorcontrib><creatorcontrib>Cikis, Can Berker</creatorcontrib><creatorcontrib>Zhang, Ce</creatorcontrib><creatorcontrib>Alonso, Gustavo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fourny, Ghislain</au><au>Dao, David</au><au>Cikis, Can Berker</au><au>Zhang, Ce</au><au>Alonso, Gustavo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>RumbleML: program the lakehouse with JSONiq</atitle><date>2021-12-23</date><risdate>2021</risdate><abstract>Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale.</abstract><doi>10.48550/arxiv.2112.12638</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2112.12638
ispartof
issn
language eng
recordid cdi_arxiv_primary_2112_12638
source arXiv.org
subjects Computer Science - Databases
title RumbleML: program the lakehouse with JSONiq
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T14%3A45%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=RumbleML:%20program%20the%20lakehouse%20with%20JSONiq&rft.au=Fourny,%20Ghislain&rft.date=2021-12-23&rft_id=info:doi/10.48550/arxiv.2112.12638&rft_dat=%3Carxiv_GOX%3E2112_12638%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true