Evolutionary k-means for distributed data sets

One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neurocomputing (Amsterdam) 2014-03, Vol.127, p.30-42
Hauptverfasser:	Naldi, M.C., Campello, R.J.G.B.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applied sciences Asymptotic properties Clustering Clusters Collection Computer science control theory systems Data processing. List processing. Character string processing Dealing Distributed clustering Distributed data mining Evolutionary k-means Exact sciences and technology Memory organisation. Data processing Software Statistical tests
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	42
container_issue
container_start_page	30
container_title	Neurocomputing (Amsterdam)
container_volume	127
creator	Naldi, M.C. Campello, R.J.G.B.
description	One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.
doi_str_mv	10.1016/j.neucom.2013.05.046
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1793285213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0925231213007674</els_id><sourcerecordid>1793285213</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</originalsourceid><addsrcrecordid>eNp9kEtLxDAUhYMoOI7-AxfdCG5a82yTjSDD-IABN7oOaXILGdtmTNIB_70dZnDp6m6-cw_nQ-iW4IpgUj9sqxEmG4aKYsIqLCrM6zO0ILKhpaSyPkcLrKgoKSP0El2ltMWYNISqBarW-9BP2YfRxJ_iqxzAjKnoQiycTzn6dsrgCmeyKRLkdI0uOtMnuDndJfp8Xn-sXsvN-8vb6mlTWlarXBpKGtdZaB0TwvFOYAKMK2msoMLWSkgwlkpFVStaxawiwjDCjWk4t5ICW6L7499dDN8TpKwHnyz0vRkhTEmTRjEqBSVsRvkRtTGkFKHTu-iHeY0mWB_06K0-6tEHPRoLPeuZY3enBpOs6btoRuvTX5bO3jgjeOYejxzMc_ceok7Ww2jB-Qg2axf8_0W_OgF8Dg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1793285213</pqid></control><display><type>article</type><title>Evolutionary k-means for distributed data sets</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Naldi, M.C. ; Campello, R.J.G.B.</creator><creatorcontrib>Naldi, M.C. ; Campello, R.J.G.B.</creatorcontrib><description>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</description><identifier>ISSN: 0925-2312</identifier><identifier>EISSN: 1872-8286</identifier><identifier>DOI: 10.1016/j.neucom.2013.05.046</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Algorithms ; Applied sciences ; Asymptotic properties ; Clustering ; Clusters ; Collection ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Dealing ; Distributed clustering ; Distributed data mining ; Evolutionary k-means ; Exact sciences and technology ; Memory organisation. Data processing ; Software ; Statistical tests</subject><ispartof>Neurocomputing (Amsterdam), 2014-03, Vol.127, p.30-42</ispartof><rights>2013 Elsevier B.V.</rights><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</citedby><cites>FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.neucom.2013.05.046$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>309,310,314,780,784,789,790,3550,23930,23931,25140,27924,27925,45995</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=28284310$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Naldi, M.C.</creatorcontrib><creatorcontrib>Campello, R.J.G.B.</creatorcontrib><title>Evolutionary k-means for distributed data sets</title><title>Neurocomputing (Amsterdam)</title><description>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Asymptotic properties</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Collection</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Dealing</subject><subject>Distributed clustering</subject><subject>Distributed data mining</subject><subject>Evolutionary k-means</subject><subject>Exact sciences and technology</subject><subject>Memory organisation. Data processing</subject><subject>Software</subject><subject>Statistical tests</subject><issn>0925-2312</issn><issn>1872-8286</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLxDAUhYMoOI7-AxfdCG5a82yTjSDD-IABN7oOaXILGdtmTNIB_70dZnDp6m6-cw_nQ-iW4IpgUj9sqxEmG4aKYsIqLCrM6zO0ILKhpaSyPkcLrKgoKSP0El2ltMWYNISqBarW-9BP2YfRxJ_iqxzAjKnoQiycTzn6dsrgCmeyKRLkdI0uOtMnuDndJfp8Xn-sXsvN-8vb6mlTWlarXBpKGtdZaB0TwvFOYAKMK2msoMLWSkgwlkpFVStaxawiwjDCjWk4t5ICW6L7499dDN8TpKwHnyz0vRkhTEmTRjEqBSVsRvkRtTGkFKHTu-iHeY0mWB_06K0-6tEHPRoLPeuZY3enBpOs6btoRuvTX5bO3jgjeOYejxzMc_ceok7Ww2jB-Qg2axf8_0W_OgF8Dg</recordid><startdate>20140315</startdate><enddate>20140315</enddate><creator>Naldi, M.C.</creator><creator>Campello, R.J.G.B.</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20140315</creationdate><title>Evolutionary k-means for distributed data sets</title><author>Naldi, M.C. ; Campello, R.J.G.B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-a217dfcebd355d4f501e3498ac525c6958eac28929b5b93c915a314aa744c82e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Asymptotic properties</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Collection</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Dealing</topic><topic>Distributed clustering</topic><topic>Distributed data mining</topic><topic>Evolutionary k-means</topic><topic>Exact sciences and technology</topic><topic>Memory organisation. Data processing</topic><topic>Software</topic><topic>Statistical tests</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Naldi, M.C.</creatorcontrib><creatorcontrib>Campello, R.J.G.B.</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Neurocomputing (Amsterdam)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Naldi, M.C.</au><au>Campello, R.J.G.B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evolutionary k-means for distributed data sets</atitle><jtitle>Neurocomputing (Amsterdam)</jtitle><date>2014-03-15</date><risdate>2014</risdate><volume>127</volume><spage>30</spage><epage>42</epage><pages>30-42</pages><issn>0925-2312</issn><eissn>1872-8286</eissn><abstract>One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.neucom.2013.05.046</doi><tpages>13</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0925-2312
ispartof	Neurocomputing (Amsterdam), 2014-03, Vol.127, p.30-42
issn	0925-2312 1872-8286
language	eng
recordid	cdi_proquest_miscellaneous_1793285213
source	Elsevier ScienceDirect Journals Complete
subjects	Algorithms Applied sciences Asymptotic properties Clustering Clusters Collection Computer science control theory systems Data processing. List processing. Character string processing Dealing Distributed clustering Distributed data mining Evolutionary k-means Exact sciences and technology Memory organisation. Data processing Software Statistical tests
title	Evolutionary k-means for distributed data sets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T01%3A00%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evolutionary%20k-means%20for%20distributed%20data%20sets&rft.jtitle=Neurocomputing%20(Amsterdam)&rft.au=Naldi,%20M.C.&rft.date=2014-03-15&rft.volume=127&rft.spage=30&rft.epage=42&rft.pages=30-42&rft.issn=0925-2312&rft.eissn=1872-8286&rft_id=info:doi/10.1016/j.neucom.2013.05.046&rft_dat=%3Cproquest_cross%3E1793285213%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1793285213&rft_id=info:pmid/&rft_els_id=S0925231213007674&rfr_iscdi=true