Robust Query Driven Cardinality Estimation under Changing Workloads

Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload dr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the VLDB Endowment 2023-02, Vol.16 (6), p.1520-1533
Hauptverfasser:	Negi, Parimarjan, Wu, Ziniu, Kipf, Andreas, Tatbul, Nesime, Marcus, Ryan, Madden, Sam, Kraska, Tim, Alizadeh, Mohammad
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1533
container_issue	6
container_start_page	1520
container_title	Proceedings of the VLDB Endowment
container_volume	16
creator	Negi, Parimarjan Wu, Ziniu Kipf, Andreas Tatbul, Nesime Marcus, Ryan Madden, Sam Kraska, Tim Alizadeh, Mohammad
description	Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates. We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40 k synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.
doi_str_mv	10.14778/3583140.3583164
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_14778_3583140_3583164</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_14778_3583140_3583164</sourcerecordid><originalsourceid>FETCH-LOGICAL-c243t-f1c23789ace602a5a4cf3c5a3f66e382363eacb9fd32e41318185b9e011b1abf3</originalsourceid><addsrcrecordid>eNpNkMFKxDAURYMoOI7uXeYHOublpWm6lDo6woAoisvymiZjtaaStEL_Xhln4erc1eVwGLsEsQJVFOYKc4OgxGpPrY7YQkIuMiPK4vjfPmVnKb0LoY0Gs2DV09BMaeSPk4szv4ndtwu8oth2gfpunPk6jd0njd0Q-BRaF3n1RmHXhR1_HeJHP1CbztmJpz65iwOX7OV2_Vxtsu3D3X11vc2sVDhmHqzEwpRknRaSclLWo80JvdYOjUSNjmxT-halU4BgwORN6QRAA9R4XDLx92vjkFJ0vv6Kv25xrkHU-wj1IUJ9iIA_IVJPmQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Robust Query Driven Cardinality Estimation under Changing Workloads</title><source>ACM Digital Library Complete</source><creator>Negi, Parimarjan ; Wu, Ziniu ; Kipf, Andreas ; Tatbul, Nesime ; Marcus, Ryan ; Madden, Sam ; Kraska, Tim ; Alizadeh, Mohammad</creator><creatorcontrib>Negi, Parimarjan ; Wu, Ziniu ; Kipf, Andreas ; Tatbul, Nesime ; Marcus, Ryan ; Madden, Sam ; Kraska, Tim ; Alizadeh, Mohammad</creatorcontrib><description>Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates. We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40 k synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.</description><identifier>ISSN: 2150-8097</identifier><identifier>EISSN: 2150-8097</identifier><identifier>DOI: 10.14778/3583140.3583164</identifier><language>eng</language><ispartof>Proceedings of the VLDB Endowment, 2023-02, Vol.16 (6), p.1520-1533</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c243t-f1c23789ace602a5a4cf3c5a3f66e382363eacb9fd32e41318185b9e011b1abf3</citedby><cites>FETCH-LOGICAL-c243t-f1c23789ace602a5a4cf3c5a3f66e382363eacb9fd32e41318185b9e011b1abf3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Negi, Parimarjan</creatorcontrib><creatorcontrib>Wu, Ziniu</creatorcontrib><creatorcontrib>Kipf, Andreas</creatorcontrib><creatorcontrib>Tatbul, Nesime</creatorcontrib><creatorcontrib>Marcus, Ryan</creatorcontrib><creatorcontrib>Madden, Sam</creatorcontrib><creatorcontrib>Kraska, Tim</creatorcontrib><creatorcontrib>Alizadeh, Mohammad</creatorcontrib><title>Robust Query Driven Cardinality Estimation under Changing Workloads</title><title>Proceedings of the VLDB Endowment</title><description>Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates. We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40 k synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.</description><issn>2150-8097</issn><issn>2150-8097</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpNkMFKxDAURYMoOI7uXeYHOublpWm6lDo6woAoisvymiZjtaaStEL_Xhln4erc1eVwGLsEsQJVFOYKc4OgxGpPrY7YQkIuMiPK4vjfPmVnKb0LoY0Gs2DV09BMaeSPk4szv4ndtwu8oth2gfpunPk6jd0njd0Q-BRaF3n1RmHXhR1_HeJHP1CbztmJpz65iwOX7OV2_Vxtsu3D3X11vc2sVDhmHqzEwpRknRaSclLWo80JvdYOjUSNjmxT-halU4BgwORN6QRAA9R4XDLx92vjkFJ0vv6Kv25xrkHU-wj1IUJ9iIA_IVJPmQ</recordid><startdate>20230201</startdate><enddate>20230201</enddate><creator>Negi, Parimarjan</creator><creator>Wu, Ziniu</creator><creator>Kipf, Andreas</creator><creator>Tatbul, Nesime</creator><creator>Marcus, Ryan</creator><creator>Madden, Sam</creator><creator>Kraska, Tim</creator><creator>Alizadeh, Mohammad</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20230201</creationdate><title>Robust Query Driven Cardinality Estimation under Changing Workloads</title><author>Negi, Parimarjan ; Wu, Ziniu ; Kipf, Andreas ; Tatbul, Nesime ; Marcus, Ryan ; Madden, Sam ; Kraska, Tim ; Alizadeh, Mohammad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c243t-f1c23789ace602a5a4cf3c5a3f66e382363eacb9fd32e41318185b9e011b1abf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Negi, Parimarjan</creatorcontrib><creatorcontrib>Wu, Ziniu</creatorcontrib><creatorcontrib>Kipf, Andreas</creatorcontrib><creatorcontrib>Tatbul, Nesime</creatorcontrib><creatorcontrib>Marcus, Ryan</creatorcontrib><creatorcontrib>Madden, Sam</creatorcontrib><creatorcontrib>Kraska, Tim</creatorcontrib><creatorcontrib>Alizadeh, Mohammad</creatorcontrib><collection>CrossRef</collection><jtitle>Proceedings of the VLDB Endowment</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Negi, Parimarjan</au><au>Wu, Ziniu</au><au>Kipf, Andreas</au><au>Tatbul, Nesime</au><au>Marcus, Ryan</au><au>Madden, Sam</au><au>Kraska, Tim</au><au>Alizadeh, Mohammad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Robust Query Driven Cardinality Estimation under Changing Workloads</atitle><jtitle>Proceedings of the VLDB Endowment</jtitle><date>2023-02-01</date><risdate>2023</risdate><volume>16</volume><issue>6</issue><spage>1520</spage><epage>1533</epage><pages>1520-1533</pages><issn>2150-8097</issn><eissn>2150-8097</eissn><abstract>Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates. We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40 k synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.</abstract><doi>10.14778/3583140.3583164</doi><tpages>14</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 2150-8097
ispartof	Proceedings of the VLDB Endowment, 2023-02, Vol.16 (6), p.1520-1533
issn	2150-8097 2150-8097
language	eng
recordid	cdi_crossref_primary_10_14778_3583140_3583164
source	ACM Digital Library Complete
title	Robust Query Driven Cardinality Estimation under Changing Workloads
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T15%3A54%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Robust%20Query%20Driven%20Cardinality%20Estimation%20under%20Changing%20Workloads&rft.jtitle=Proceedings%20of%20the%20VLDB%20Endowment&rft.au=Negi,%20Parimarjan&rft.date=2023-02-01&rft.volume=16&rft.issue=6&rft.spage=1520&rft.epage=1533&rft.pages=1520-1533&rft.issn=2150-8097&rft.eissn=2150-8097&rft_id=info:doi/10.14778/3583140.3583164&rft_dat=%3Ccrossref%3E10_14778_3583140_3583164%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true