Evolutionary k-means for distributed data sets

One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) 2014-03, Vol.127, p.30-42
Hauptverfasser: Naldi, M.C., Campello, R.J.G.B.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 42
container_issue
container_start_page 30
container_title Neurocomputing (Amsterdam)
container_volume 127
creator Naldi, M.C.
Campello, R.J.G.B.
description One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.
doi_str_mv 10.1016/j.neucom.2013.05.046
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1793285213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0925231213007674</els_id><sourcerecordid>1793285213</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</originalsourceid><addsrcrecordid>eNp9kEtLxDAUhYMoOI7-AxfdCG5a82yTjSDD-IABN7oOaXILGdtmTNIB_70dZnDp6m6-cw_nQ-iW4IpgUj9sqxEmG4aKYsIqLCrM6zO0ILKhpaSyPkcLrKgoKSP0El2ltMWYNISqBarW-9BP2YfRxJ_iqxzAjKnoQiycTzn6dsrgCmeyKRLkdI0uOtMnuDndJfp8Xn-sXsvN-8vb6mlTWlarXBpKGtdZaB0TwvFOYAKMK2msoMLWSkgwlkpFVStaxawiwjDCjWk4t5ICW6L7499dDN8TpKwHnyz0vRkhTEmTRjEqBSVsRvkRtTGkFKHTu-iHeY0mWB_06K0-6tEHPRoLPeuZY3enBpOs6btoRuvTX5bO3jgjeOYejxzMc_ceok7Ww2jB-Qg2axf8_0W_OgF8Dg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1793285213</pqid></control><display><type>article</type><title>Evolutionary k-means for distributed data sets</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Naldi, M.C. ; Campello, R.J.G.B.</creator><creatorcontrib>Naldi, M.C. ; Campello, R.J.G.B.</creatorcontrib><description>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</description><identifier>ISSN: 0925-2312</identifier><identifier>EISSN: 1872-8286</identifier><identifier>DOI: 10.1016/j.neucom.2013.05.046</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Algorithms ; Applied sciences ; Asymptotic properties ; Clustering ; Clusters ; Collection ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Dealing ; Distributed clustering ; Distributed data mining ; Evolutionary k-means ; Exact sciences and technology ; Memory organisation. Data processing ; Software ; Statistical tests</subject><ispartof>Neurocomputing (Amsterdam), 2014-03, Vol.127, p.30-42</ispartof><rights>2013 Elsevier B.V.</rights><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</citedby><cites>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.neucom.2013.05.046$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>309,310,314,780,784,789,790,3550,23930,23931,25140,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=28284310$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Naldi, M.C.</creatorcontrib><creatorcontrib>Campello, R.J.G.B.</creatorcontrib><title>Evolutionary k-means for distributed data sets</title><title>Neurocomputing (Amsterdam)</title><description>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Asymptotic properties</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Collection</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Dealing</subject><subject>Distributed clustering</subject><subject>Distributed data mining</subject><subject>Evolutionary k-means</subject><subject>Exact sciences and technology</subject><subject>Memory organisation. Data processing</subject><subject>Software</subject><subject>Statistical tests</subject><issn>0925-2312</issn><issn>1872-8286</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLxDAUhYMoOI7-AxfdCG5a82yTjSDD-IABN7oOaXILGdtmTNIB_70dZnDp6m6-cw_nQ-iW4IpgUj9sqxEmG4aKYsIqLCrM6zO0ILKhpaSyPkcLrKgoKSP0El2ltMWYNISqBarW-9BP2YfRxJ_iqxzAjKnoQiycTzn6dsrgCmeyKRLkdI0uOtMnuDndJfp8Xn-sXsvN-8vb6mlTWlarXBpKGtdZaB0TwvFOYAKMK2msoMLWSkgwlkpFVStaxawiwjDCjWk4t5ICW6L7499dDN8TpKwHnyz0vRkhTEmTRjEqBSVsRvkRtTGkFKHTu-iHeY0mWB_06K0-6tEHPRoLPeuZY3enBpOs6btoRuvTX5bO3jgjeOYejxzMc_ceok7Ww2jB-Qg2axf8_0W_OgF8Dg</recordid><startdate>20140315</startdate><enddate>20140315</enddate><creator>Naldi, M.C.</creator><creator>Campello, R.J.G.B.</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20140315</creationdate><title>Evolutionary k-means for distributed data sets</title><author>Naldi, M.C. ; Campello, R.J.G.B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Asymptotic properties</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Collection</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Dealing</topic><topic>Distributed clustering</topic><topic>Distributed data mining</topic><topic>Evolutionary k-means</topic><topic>Exact sciences and technology</topic><topic>Memory organisation. Data processing</topic><topic>Software</topic><topic>Statistical tests</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Naldi, M.C.</creatorcontrib><creatorcontrib>Campello, R.J.G.B.</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Neurocomputing (Amsterdam)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Naldi, M.C.</au><au>Campello, R.J.G.B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evolutionary k-means for distributed data sets</atitle><jtitle>Neurocomputing (Amsterdam)</jtitle><date>2014-03-15</date><risdate>2014</risdate><volume>127</volume><spage>30</spage><epage>42</epage><pages>30-42</pages><issn>0925-2312</issn><eissn>1872-8286</eissn><abstract>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.neucom.2013.05.046</doi><tpages>13</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0925-2312
ispartof Neurocomputing (Amsterdam), 2014-03, Vol.127, p.30-42
issn 0925-2312
1872-8286
language eng
recordid cdi_proquest_miscellaneous_1793285213
source Elsevier ScienceDirect Journals Complete
subjects Algorithms
Applied sciences
Asymptotic properties
Clustering
Clusters
Collection
Computer science
control theory
systems
Data processing. List processing. Character string processing
Dealing
Distributed clustering
Distributed data mining
Evolutionary k-means
Exact sciences and technology
Memory organisation. Data processing
Software
Statistical tests
title Evolutionary k-means for distributed data sets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T01%3A00%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evolutionary%20k-means%20for%20distributed%20data%20sets&rft.jtitle=Neurocomputing%20(Amsterdam)&rft.au=Naldi,%20M.C.&rft.date=2014-03-15&rft.volume=127&rft.spage=30&rft.epage=42&rft.pages=30-42&rft.issn=0925-2312&rft.eissn=1872-8286&rft_id=info:doi/10.1016/j.neucom.2013.05.046&rft_dat=%3Cproquest_cross%3E1793285213%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1793285213&rft_id=info:pmid/&rft_els_id=S0925231213007674&rfr_iscdi=true