Clustering Uncertain Data Based on Probability Distribution Similarity

Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2013-04, Vol.25 (4), p.751-763
Hauptverfasser: Jiang, Bin, Pei, Jian, Tao, Yufei, Lin, Xuemin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 763
container_issue 4
container_start_page 751
container_title IEEE transactions on knowledge and data engineering
container_volume 25
creator Jiang, Bin
Pei, Jian
Tao, Yufei
Lin, Xuemin
description Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.
doi_str_mv 10.1109/TKDE.2011.221
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_6051435</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6051435</ieee_id><sourcerecordid>10_1109_TKDE_2011_221</sourcerecordid><originalsourceid>FETCH-LOGICAL-c301t-f249bc70e6e3eaea9fcc01a1b327d2702baaf8f8e6c1f5f0358289e29a9496a73</originalsourceid><addsrcrecordid>eNo9kLtOAzEQRS0EEiFQUtHsD2yYsfdhl5AXiEggkdSrWWeMjDYbZDtF_p6Ngqjmau7RLY4Q9wgTRDCP67fZfCIBcSIlXogRlqXOJRq8HDIUmBeqqK_FTYzfAKBrjSOxmHaHmDj4_ivb9JZDIt9nM0qUPVPkbbbvs4-wb6n1nU_HbOZjCr49JD8Un37nOwrD_1ZcOeoi3_3dsdgs5uvpS756X75On1a5VYApd7Iwra2BK1ZMTMZZC0jYKllvZQ2yJXLaaa4sutKBKrXUhqUhU5iKajUW-XnXhn2MgV3zE_yOwrFBaE4SmpOE5iShGSQM_MOZ98z8z1ZQYqFK9Qvkj1kM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Clustering Uncertain Data Based on Probability Distribution Similarity</title><source>IEEE Electronic Library (IEL)</source><creator>Jiang, Bin ; Pei, Jian ; Tao, Yufei ; Lin, Xuemin</creator><creatorcontrib>Jiang, Bin ; Pei, Jian ; Tao, Yufei ; Lin, Xuemin</creatorcontrib><description>Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2011.221</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cameras ; Clustering ; Clustering algorithms ; density estimation ; Educational institutions ; fast Gauss transform ; Kernel ; Measurement uncertainty ; probabilistic distribution ; Probability distribution ; Random variables ; uncertain data</subject><ispartof>IEEE transactions on knowledge and data engineering, 2013-04, Vol.25 (4), p.751-763</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c301t-f249bc70e6e3eaea9fcc01a1b327d2702baaf8f8e6c1f5f0358289e29a9496a73</citedby><cites>FETCH-LOGICAL-c301t-f249bc70e6e3eaea9fcc01a1b327d2702baaf8f8e6c1f5f0358289e29a9496a73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6051435$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6051435$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Jiang, Bin</creatorcontrib><creatorcontrib>Pei, Jian</creatorcontrib><creatorcontrib>Tao, Yufei</creatorcontrib><creatorcontrib>Lin, Xuemin</creatorcontrib><title>Clustering Uncertain Data Based on Probability Distribution Similarity</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.</description><subject>Cameras</subject><subject>Clustering</subject><subject>Clustering algorithms</subject><subject>density estimation</subject><subject>Educational institutions</subject><subject>fast Gauss transform</subject><subject>Kernel</subject><subject>Measurement uncertainty</subject><subject>probabilistic distribution</subject><subject>Probability distribution</subject><subject>Random variables</subject><subject>uncertain data</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kLtOAzEQRS0EEiFQUtHsD2yYsfdhl5AXiEggkdSrWWeMjDYbZDtF_p6Ngqjmau7RLY4Q9wgTRDCP67fZfCIBcSIlXogRlqXOJRq8HDIUmBeqqK_FTYzfAKBrjSOxmHaHmDj4_ivb9JZDIt9nM0qUPVPkbbbvs4-wb6n1nU_HbOZjCr49JD8Un37nOwrD_1ZcOeoi3_3dsdgs5uvpS756X75On1a5VYApd7Iwra2BK1ZMTMZZC0jYKllvZQ2yJXLaaa4sutKBKrXUhqUhU5iKajUW-XnXhn2MgV3zE_yOwrFBaE4SmpOE5iShGSQM_MOZ98z8z1ZQYqFK9Qvkj1kM</recordid><startdate>20130401</startdate><enddate>20130401</enddate><creator>Jiang, Bin</creator><creator>Pei, Jian</creator><creator>Tao, Yufei</creator><creator>Lin, Xuemin</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20130401</creationdate><title>Clustering Uncertain Data Based on Probability Distribution Similarity</title><author>Jiang, Bin ; Pei, Jian ; Tao, Yufei ; Lin, Xuemin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c301t-f249bc70e6e3eaea9fcc01a1b327d2702baaf8f8e6c1f5f0358289e29a9496a73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Cameras</topic><topic>Clustering</topic><topic>Clustering algorithms</topic><topic>density estimation</topic><topic>Educational institutions</topic><topic>fast Gauss transform</topic><topic>Kernel</topic><topic>Measurement uncertainty</topic><topic>probabilistic distribution</topic><topic>Probability distribution</topic><topic>Random variables</topic><topic>uncertain data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Bin</creatorcontrib><creatorcontrib>Pei, Jian</creatorcontrib><creatorcontrib>Tao, Yufei</creatorcontrib><creatorcontrib>Lin, Xuemin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jiang, Bin</au><au>Pei, Jian</au><au>Tao, Yufei</au><au>Lin, Xuemin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Clustering Uncertain Data Based on Probability Distribution Similarity</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2013-04-01</date><risdate>2013</risdate><volume>25</volume><issue>4</issue><spage>751</spage><epage>763</epage><pages>751-763</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.</abstract><pub>IEEE</pub><doi>10.1109/TKDE.2011.221</doi><tpages>13</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2013-04, Vol.25 (4), p.751-763
issn 1041-4347
1558-2191
language eng
recordid cdi_ieee_primary_6051435
source IEEE Electronic Library (IEL)
subjects Cameras
Clustering
Clustering algorithms
density estimation
Educational institutions
fast Gauss transform
Kernel
Measurement uncertainty
probabilistic distribution
Probability distribution
Random variables
uncertain data
title Clustering Uncertain Data Based on Probability Distribution Similarity
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T01%3A19%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Clustering%20Uncertain%20Data%20Based%20on%20Probability%20Distribution%20Similarity&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Jiang,%20Bin&rft.date=2013-04-01&rft.volume=25&rft.issue=4&rft.spage=751&rft.epage=763&rft.pages=751-763&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2011.221&rft_dat=%3Ccrossref_RIE%3E10_1109_TKDE_2011_221%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6051435&rfr_iscdi=true