DataVinci: Learning Syntactic and Semantic String Repairs

String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Singh, Mukul, Cambronero, José, Gulwani, Sumit, Le, Vu, Negreanu, Carina, Verbruggen, Gust
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Singh, Mukul
Cambronero, José
Gulwani, Sumit
Le, Vu
Negreanu, Carina
Verbruggen, Gust
description String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.
doi_str_mv 10.48550/arxiv.2308.10922
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2308_10922</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2308_10922</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-f5d320fe84a138f155c6241585a7a3f8b62f0caa09d9544ad9f0107a1e8c2e23</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq-QOvNTdIm7mT8hcKAFbflTJoMAScMmSLO20tHV4fDgY_zCXEjqdHWGLpD-UnfDSuyjSTHfCncI2Z8puzTfdUHlJzyrhpOeYafk6-Qp2oIe-SlDHNZ1vdwQCrHK3ER8XUM1_-5EsPz08f6te43L2_rh75G23EdzaSYYrAaUtkojfEta2msQQcV7bblSB4gNzmjNSYXSVIHGaznwGolbv-o5-vjoaQ9ymlcFMazgvoFSe5ASA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DataVinci: Learning Syntactic and Semantic String Repairs</title><source>arXiv.org</source><creator>Singh, Mukul ; Cambronero, José ; Gulwani, Sumit ; Le, Vu ; Negreanu, Carina ; Verbruggen, Gust</creator><creatorcontrib>Singh, Mukul ; Cambronero, José ; Gulwani, Sumit ; Le, Vu ; Negreanu, Carina ; Verbruggen, Gust</creatorcontrib><description>String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.</description><identifier>DOI: 10.48550/arxiv.2308.10922</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Databases</subject><creationdate>2023-08</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2308.10922$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2308.10922$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Singh, Mukul</creatorcontrib><creatorcontrib>Cambronero, José</creatorcontrib><creatorcontrib>Gulwani, Sumit</creatorcontrib><creatorcontrib>Le, Vu</creatorcontrib><creatorcontrib>Negreanu, Carina</creatorcontrib><creatorcontrib>Verbruggen, Gust</creatorcontrib><title>DataVinci: Learning Syntactic and Semantic String Repairs</title><description>String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Databases</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq-QOvNTdIm7mT8hcKAFbflTJoMAScMmSLO20tHV4fDgY_zCXEjqdHWGLpD-UnfDSuyjSTHfCncI2Z8puzTfdUHlJzyrhpOeYafk6-Qp2oIe-SlDHNZ1vdwQCrHK3ER8XUM1_-5EsPz08f6te43L2_rh75G23EdzaSYYrAaUtkojfEta2msQQcV7bblSB4gNzmjNSYXSVIHGaznwGolbv-o5-vjoaQ9ymlcFMazgvoFSe5ASA</recordid><startdate>20230821</startdate><enddate>20230821</enddate><creator>Singh, Mukul</creator><creator>Cambronero, José</creator><creator>Gulwani, Sumit</creator><creator>Le, Vu</creator><creator>Negreanu, Carina</creator><creator>Verbruggen, Gust</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230821</creationdate><title>DataVinci: Learning Syntactic and Semantic String Repairs</title><author>Singh, Mukul ; Cambronero, José ; Gulwani, Sumit ; Le, Vu ; Negreanu, Carina ; Verbruggen, Gust</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-f5d320fe84a138f155c6241585a7a3f8b62f0caa09d9544ad9f0107a1e8c2e23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Databases</topic><toplevel>online_resources</toplevel><creatorcontrib>Singh, Mukul</creatorcontrib><creatorcontrib>Cambronero, José</creatorcontrib><creatorcontrib>Gulwani, Sumit</creatorcontrib><creatorcontrib>Le, Vu</creatorcontrib><creatorcontrib>Negreanu, Carina</creatorcontrib><creatorcontrib>Verbruggen, Gust</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Singh, Mukul</au><au>Cambronero, José</au><au>Gulwani, Sumit</au><au>Le, Vu</au><au>Negreanu, Carina</au><au>Verbruggen, Gust</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DataVinci: Learning Syntactic and Semantic String Repairs</atitle><date>2023-08-21</date><risdate>2023</risdate><abstract>String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.</abstract><doi>10.48550/arxiv.2308.10922</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2308.10922
ispartof
issn
language eng
recordid cdi_arxiv_primary_2308_10922
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Databases
title DataVinci: Learning Syntactic and Semantic String Repairs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T17%3A42%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DataVinci:%20Learning%20Syntactic%20and%20Semantic%20String%20Repairs&rft.au=Singh,%20Mukul&rft.date=2023-08-21&rft_id=info:doi/10.48550/arxiv.2308.10922&rft_dat=%3Carxiv_GOX%3E2308_10922%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true