Determining the Largest Overlap between Tables

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the ACM on management of data 2024-03, Vol.2 (1), p.1-26, Article 48
Hauptverfasser: Zecchini, Luca, Bleifuß, Tobias, Simonini, Giovanni, Bergamaschi, Sonia, Naumann, Felix
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 26
container_issue 1
container_start_page 1
container_title Proceedings of the ACM on management of data
container_volume 2
creator Zecchini, Luca
Bleifuß, Tobias
Simonini, Giovanni
Bergamaschi, Sonia
Naumann, Felix
description Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.
doi_str_mv 10.1145/3639303
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3639303</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3639303</sourcerecordid><originalsourceid>FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</originalsourceid><addsrcrecordid>eNpNj01LAzEYhINYsNTSu6fcPG1N8mazyVGqVmGhl70v-XhTV3bXkiyK_95Kq3iaGeZhYAhZcbbmXJZ3oMAAgwsyFxpUocoKLv_5K7LM-Y0xJowCbtScrB9wwjR0Yzfu6fSKtLZpj3miuw9MvT1Qh9Mn4kgb63rM12QWbZ9xedYFaZ4em81zUe-2L5v7urBaQqG8KE1AE0rmhEAFurKxciYEgCgC6OgB-DHHGJx12mvluPRSiYpZJiQsyO1p1qf3nBPG9pC6waavlrP252h7Pnokb06k9cMf9Ft-A6kKTEE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Determining the Largest Overlap between Tables</title><source>Access via ACM Digital Library</source><creator>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</creator><creatorcontrib>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</creatorcontrib><description>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</description><identifier>ISSN: 2836-6573</identifier><identifier>EISSN: 2836-6573</identifier><identifier>DOI: 10.1145/3639303</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Data management systems ; Deduplication ; Information integration ; Information systems</subject><ispartof>Proceedings of the ACM on management of data, 2024-03, Vol.2 (1), p.1-26, Article 48</ispartof><rights>Owner/Author</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</cites><orcidid>0000-0002-4856-0838 ; 0000-0002-3466-509X ; 0009-0006-9517-7707 ; 0000-0002-4483-1389 ; 0000-0001-8087-6587</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3639303$$EPDF$$P50$$Gacm$$Hfree_for_read</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Zecchini, Luca</creatorcontrib><creatorcontrib>Bleifuß, Tobias</creatorcontrib><creatorcontrib>Simonini, Giovanni</creatorcontrib><creatorcontrib>Bergamaschi, Sonia</creatorcontrib><creatorcontrib>Naumann, Felix</creatorcontrib><title>Determining the Largest Overlap between Tables</title><title>Proceedings of the ACM on management of data</title><addtitle>ACM PACMMOD</addtitle><description>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</description><subject>Data management systems</subject><subject>Deduplication</subject><subject>Information integration</subject><subject>Information systems</subject><issn>2836-6573</issn><issn>2836-6573</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNj01LAzEYhINYsNTSu6fcPG1N8mazyVGqVmGhl70v-XhTV3bXkiyK_95Kq3iaGeZhYAhZcbbmXJZ3oMAAgwsyFxpUocoKLv_5K7LM-Y0xJowCbtScrB9wwjR0Yzfu6fSKtLZpj3miuw9MvT1Qh9Mn4kgb63rM12QWbZ9xedYFaZ4em81zUe-2L5v7urBaQqG8KE1AE0rmhEAFurKxciYEgCgC6OgB-DHHGJx12mvluPRSiYpZJiQsyO1p1qf3nBPG9pC6waavlrP252h7Pnokb06k9cMf9Ft-A6kKTEE</recordid><startdate>20240326</startdate><enddate>20240326</enddate><creator>Zecchini, Luca</creator><creator>Bleifuß, Tobias</creator><creator>Simonini, Giovanni</creator><creator>Bergamaschi, Sonia</creator><creator>Naumann, Felix</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4856-0838</orcidid><orcidid>https://orcid.org/0000-0002-3466-509X</orcidid><orcidid>https://orcid.org/0009-0006-9517-7707</orcidid><orcidid>https://orcid.org/0000-0002-4483-1389</orcidid><orcidid>https://orcid.org/0000-0001-8087-6587</orcidid></search><sort><creationdate>20240326</creationdate><title>Determining the Largest Overlap between Tables</title><author>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data management systems</topic><topic>Deduplication</topic><topic>Information integration</topic><topic>Information systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zecchini, Luca</creatorcontrib><creatorcontrib>Bleifuß, Tobias</creatorcontrib><creatorcontrib>Simonini, Giovanni</creatorcontrib><creatorcontrib>Bergamaschi, Sonia</creatorcontrib><creatorcontrib>Naumann, Felix</creatorcontrib><collection>CrossRef</collection><jtitle>Proceedings of the ACM on management of data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zecchini, Luca</au><au>Bleifuß, Tobias</au><au>Simonini, Giovanni</au><au>Bergamaschi, Sonia</au><au>Naumann, Felix</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Determining the Largest Overlap between Tables</atitle><jtitle>Proceedings of the ACM on management of data</jtitle><stitle>ACM PACMMOD</stitle><date>2024-03-26</date><risdate>2024</risdate><volume>2</volume><issue>1</issue><spage>1</spage><epage>26</epage><pages>1-26</pages><artnum>48</artnum><issn>2836-6573</issn><eissn>2836-6573</eissn><abstract>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3639303</doi><tpages>26</tpages><orcidid>https://orcid.org/0000-0002-4856-0838</orcidid><orcidid>https://orcid.org/0000-0002-3466-509X</orcidid><orcidid>https://orcid.org/0009-0006-9517-7707</orcidid><orcidid>https://orcid.org/0000-0002-4483-1389</orcidid><orcidid>https://orcid.org/0000-0001-8087-6587</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2836-6573
ispartof Proceedings of the ACM on management of data, 2024-03, Vol.2 (1), p.1-26, Article 48
issn 2836-6573
2836-6573
language eng
recordid cdi_crossref_primary_10_1145_3639303
source Access via ACM Digital Library
subjects Data management systems
Deduplication
Information integration
Information systems
title Determining the Largest Overlap between Tables
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T07%3A44%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Determining%20the%20Largest%20Overlap%20between%20Tables&rft.jtitle=Proceedings%20of%20the%20ACM%20on%20management%20of%20data&rft.au=Zecchini,%20Luca&rft.date=2024-03-26&rft.volume=2&rft.issue=1&rft.spage=1&rft.epage=26&rft.pages=1-26&rft.artnum=48&rft.issn=2836-6573&rft.eissn=2836-6573&rft_id=info:doi/10.1145/3639303&rft_dat=%3Cacm_cross%3E3639303%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true