Estimating the selectivity of tf-idf based cosine similarity predicates

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SIGMOD record 2007-06, Vol.36 (2), p.7-12
Hauptverfasser:	Tata, Sandeep, Patel, Jignesh M.
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12
container_issue	2
container_start_page	7
container_title	SIGMOD record
container_volume	36
creator	Tata, Sandeep Patel, Jignesh M.
description	An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.
doi_str_mv	10.1145/1328854.1328855
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_1328854_1328855</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_1328854_1328855</sourcerecordid><originalsourceid>FETCH-LOGICAL-c241t-16277af2f081e57e591d95c59b157da27cc1de9d2b539a91d4705be01ee54b73</originalsourceid><addsrcrecordid>eNotj7FqwzAURTW00DTt3FU_4ERP8rPksYQ0LQSyZDey9NSqOHGQRCF_X4d4OsO9XO5h7A3ECqDGNShpDNarO_GBLQQ0qkIjzBN7zvlXCDDQiAXbbXOJJ1vi-ZuXH-KZBnIl_sVy5WPgJVTRB97bTJ67McfzVImnONh0a1wS-ehsofzCHoMdMr3OXLLjx_a4-az2h93X5n1fOVlDqaCRWtsggzBAqAlb8C06bHtA7a3UzoGn1sseVWunsNYCexJAhHWv1ZKt77MujTknCt0lTffTtQPR3dS7WX0mqn-SME5B</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Estimating the selectivity of tf-idf based cosine similarity predicates</title><source>ACM Digital Library Complete</source><creator>Tata, Sandeep ; Patel, Jignesh M.</creator><creatorcontrib>Tata, Sandeep ; Patel, Jignesh M.</creatorcontrib><description>An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.</description><identifier>ISSN: 0163-5808</identifier><identifier>DOI: 10.1145/1328854.1328855</identifier><language>eng</language><ispartof>SIGMOD record, 2007-06, Vol.36 (2), p.7-12</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c241t-16277af2f081e57e591d95c59b157da27cc1de9d2b539a91d4705be01ee54b73</citedby><cites>FETCH-LOGICAL-c241t-16277af2f081e57e591d95c59b157da27cc1de9d2b539a91d4705be01ee54b73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Tata, Sandeep</creatorcontrib><creatorcontrib>Patel, Jignesh M.</creatorcontrib><title>Estimating the selectivity of tf-idf based cosine similarity predicates</title><title>SIGMOD record</title><description>An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.</description><issn>0163-5808</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2007</creationdate><recordtype>article</recordtype><recordid>eNotj7FqwzAURTW00DTt3FU_4ERP8rPksYQ0LQSyZDey9NSqOHGQRCF_X4d4OsO9XO5h7A3ECqDGNShpDNarO_GBLQQ0qkIjzBN7zvlXCDDQiAXbbXOJJ1vi-ZuXH-KZBnIl_sVy5WPgJVTRB97bTJ67McfzVImnONh0a1wS-ehsofzCHoMdMr3OXLLjx_a4-az2h93X5n1fOVlDqaCRWtsggzBAqAlb8C06bHtA7a3UzoGn1sseVWunsNYCexJAhHWv1ZKt77MujTknCt0lTffTtQPR3dS7WX0mqn-SME5B</recordid><startdate>200706</startdate><enddate>200706</enddate><creator>Tata, Sandeep</creator><creator>Patel, Jignesh M.</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>200706</creationdate><title>Estimating the selectivity of tf-idf based cosine similarity predicates</title><author>Tata, Sandeep ; Patel, Jignesh M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c241t-16277af2f081e57e591d95c59b157da27cc1de9d2b539a91d4705be01ee54b73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2007</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Tata, Sandeep</creatorcontrib><creatorcontrib>Patel, Jignesh M.</creatorcontrib><collection>CrossRef</collection><jtitle>SIGMOD record</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tata, Sandeep</au><au>Patel, Jignesh M.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Estimating the selectivity of tf-idf based cosine similarity predicates</atitle><jtitle>SIGMOD record</jtitle><date>2007-06</date><risdate>2007</risdate><volume>36</volume><issue>2</issue><spage>7</spage><epage>12</epage><pages>7-12</pages><issn>0163-5808</issn><abstract>An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.</abstract><doi>10.1145/1328854.1328855</doi><tpages>6</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0163-5808
ispartof	SIGMOD record, 2007-06, Vol.36 (2), p.7-12
issn	0163-5808
language	eng
recordid	cdi_crossref_primary_10_1145_1328854_1328855
source	ACM Digital Library Complete
title	Estimating the selectivity of tf-idf based cosine similarity predicates
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A24%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Estimating%20the%20selectivity%20of%20tf-idf%20based%20cosine%20similarity%20predicates&rft.jtitle=SIGMOD%20record&rft.au=Tata,%20Sandeep&rft.date=2007-06&rft.volume=36&rft.issue=2&rft.spage=7&rft.epage=12&rft.pages=7-12&rft.issn=0163-5808&rft_id=info:doi/10.1145/1328854.1328855&rft_dat=%3Ccrossref%3E10_1145_1328854_1328855%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true