Determining the Largest Overlap between Tables

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the ACM on management of data 2024-03, Vol.2 (1), p.1-26, Article 48
Hauptverfasser:	Zecchini, Luca, Bleifuß, Tobias, Simonini, Giovanni, Bergamaschi, Sonia, Naumann, Felix
Format:	Artikel
Sprache:	eng
Schlagworte:	Data management systems Deduplication Information integration Information systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	26
container_issue	1
container_start_page	1
container_title	Proceedings of the ACM on management of data
container_volume	2
creator	Zecchini, Luca Bleifuß, Tobias Simonini, Giovanni Bergamaschi, Sonia Naumann, Felix
description	Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.
doi_str_mv	10.1145/3639303
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3639303</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3639303</sourcerecordid><originalsourceid>FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</originalsourceid><addsrcrecordid>eNpNj01LAzEYhINYsNTSu6fcPG1N8mazyVGqVmGhl70v-XhTV3bXkiyK_95Kq3iaGeZhYAhZcbbmXJZ3oMAAgwsyFxpUocoKLv_5K7LM-Y0xJowCbtScrB9wwjR0Yzfu6fSKtLZpj3miuw9MvT1Qh9Mn4kgb63rM12QWbZ9xedYFaZ4em81zUe-2L5v7urBaQqG8KE1AE0rmhEAFurKxciYEgCgC6OgB-DHHGJx12mvluPRSiYpZJiQsyO1p1qf3nBPG9pC6waavlrP252h7Pnokb06k9cMf9Ft-A6kKTEE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Determining the Largest Overlap between Tables</title><source>Access via ACM Digital Library</source><creator>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</creator><creatorcontrib>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</creatorcontrib><description>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</description><identifier>ISSN: 2836-6573</identifier><identifier>EISSN: 2836-6573</identifier><identifier>DOI: 10.1145/3639303</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Data management systems ; Deduplication ; Information integration ; Information systems</subject><ispartof>Proceedings of the ACM on management of data, 2024-03, Vol.2 (1), p.1-26, Article 48</ispartof><rights>Owner/Author</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</cites><orcidid>0000-0002-4856-0838 ; 0000-0002-3466-509X ; 0009-0006-9517-7707 ; 0000-0002-4483-1389 ; 0000-0001-8087-6587</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3639303$$EPDF$$P50$$Gacm$$Hfree_for_read</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Zecchini, Luca</creatorcontrib><creatorcontrib>Bleifuß, Tobias</creatorcontrib><creatorcontrib>Simonini, Giovanni</creatorcontrib><creatorcontrib>Bergamaschi, Sonia</creatorcontrib><creatorcontrib>Naumann, Felix</creatorcontrib><title>Determining the Largest Overlap between Tables</title><title>Proceedings of the ACM on management of data</title><addtitle>ACM PACMMOD</addtitle><description>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</description><subject>Data management systems</subject><subject>Deduplication</subject><subject>Information integration</subject><subject>Information systems</subject><issn>2836-6573</issn><issn>2836-6573</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNj01LAzEYhINYsNTSu6fcPG1N8mazyVGqVmGhl70v-XhTV3bXkiyK_95Kq3iaGeZhYAhZcbbmXJZ3oMAAgwsyFxpUocoKLv_5K7LM-Y0xJowCbtScrB9wwjR0Yzfu6fSKtLZpj3miuw9MvT1Qh9Mn4kgb63rM12QWbZ9xedYFaZ4em81zUe-2L5v7urBaQqG8KE1AE0rmhEAFurKxciYEgCgC6OgB-DHHGJx12mvluPRSiYpZJiQsyO1p1qf3nBPG9pC6waavlrP252h7Pnokb06k9cMf9Ft-A6kKTEE</recordid><startdate>20240326</startdate><enddate>20240326</enddate><creator>Zecchini, Luca</creator><creator>Bleifuß, Tobias</creator><creator>Simonini, Giovanni</creator><creator>Bergamaschi, Sonia</creator><creator>Naumann, Felix</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4856-0838</orcidid><orcidid>https://orcid.org/0000-0002-3466-509X</orcidid><orcidid>https://orcid.org/0009-0006-9517-7707</orcidid><orcidid>https://orcid.org/0000-0002-4483-1389</orcidid><orcidid>https://orcid.org/0000-0001-8087-6587</orcidid></search><sort><creationdate>20240326</creationdate><title>Determining the Largest Overlap between Tables</title><author>Zecchini, Luca ; Bleifuß, Tobias ; Simonini, Giovanni ; Bergamaschi, Sonia ; Naumann, Felix</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a843-6c259de9d50b22e6387af7b9dd33f2d38fc331b9dffdbab8c86b14c46270a0243</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Data management systems</topic><topic>Deduplication</topic><topic>Information integration</topic><topic>Information systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zecchini, Luca</creatorcontrib><creatorcontrib>Bleifuß, Tobias</creatorcontrib><creatorcontrib>Simonini, Giovanni</creatorcontrib><creatorcontrib>Bergamaschi, Sonia</creatorcontrib><creatorcontrib>Naumann, Felix</creatorcontrib><collection>CrossRef</collection><jtitle>Proceedings of the ACM on management of data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zecchini, Luca</au><au>Bleifuß, Tobias</au><au>Simonini, Giovanni</au><au>Bergamaschi, Sonia</au><au>Naumann, Felix</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Determining the Largest Overlap between Tables</atitle><jtitle>Proceedings of the ACM on management of data</jtitle><stitle>ACM PACMMOD</stitle><date>2024-03-26</date><risdate>2024</risdate><volume>2</volume><issue>1</issue><spage>1</spage><epage>26</epage><pages>1-26</pages><artnum>48</artnum><issn>2836-6573</issn><eissn>2836-6573</eissn><abstract>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3639303</doi><tpages>26</tpages><orcidid>https://orcid.org/0000-0002-4856-0838</orcidid><orcidid>https://orcid.org/0000-0002-3466-509X</orcidid><orcidid>https://orcid.org/0009-0006-9517-7707</orcidid><orcidid>https://orcid.org/0000-0002-4483-1389</orcidid><orcidid>https://orcid.org/0000-0001-8087-6587</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2836-6573
ispartof	Proceedings of the ACM on management of data, 2024-03, Vol.2 (1), p.1-26, Article 48
issn	2836-6573 2836-6573
language	eng
recordid	cdi_crossref_primary_10_1145_3639303
source	Access via ACM Digital Library
subjects	Data management systems Deduplication Information integration Information systems
title	Determining the Largest Overlap between Tables
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T07%3A44%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Determining%20the%20Largest%20Overlap%20between%20Tables&rft.jtitle=Proceedings%20of%20the%20ACM%20on%20management%20of%20data&rft.au=Zecchini,%20Luca&rft.date=2024-03-26&rft.volume=2&rft.issue=1&rft.spage=1&rft.epage=26&rft.pages=1-26&rft.artnum=48&rft.issn=2836-6573&rft.eissn=2836-6573&rft_id=info:doi/10.1145/3639303&rft_dat=%3Cacm_cross%3E3639303%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true