Consistent selection of the number of clusters via crossvalidation

In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Biometrika 2010-12, Vol.97 (4), p.893-904
1. Verfasser:	WANG, JUNHUI
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applications Automobiles Bee clustering Biology, psychology, social sciences Cluster analysis Clustering Crossvalidation Datasets Estimate reliability Estimating techniques Exact sciences and technology General topics k-means Mathematics Multivariate analysis Numerical analysis Optimization algorithms Parametric inference Probability and statistics Randomness Sciences and techniques of general use Selection consistency Silhouettes Spectral clustering Stability Statistics Studies Validity Zero
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	904
container_issue	4
container_start_page	893
container_title	Biometrika
container_volume	97
creator	WANG, JUNHUI
description	In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.
doi_str_mv	10.1093/biomet/asq061
format	Article
fullrecord	<record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_miscellaneous_1171871959</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>29777144</jstor_id><sourcerecordid>29777144</sourcerecordid><originalsourceid>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</originalsourceid><addsrcrecordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>817786435</pqid></control><display><type>article</type><title>Consistent selection of the number of clusters via crossvalidation</title><source>RePEc</source><source>JSTOR Mathematics & Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><source>Oxford University Press Journals All Titles (1996-Current)</source><source>Alma/SFX Local Collection</source><creator>WANG, JUNHUI</creator><creatorcontrib>WANG, JUNHUI</creatorcontrib><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><identifier>ISSN: 0006-3444</identifier><identifier>ISSN: 1464-3510</identifier><identifier>EISSN: 1464-3510</identifier><identifier>EISSN: 0006-3444</identifier><identifier>DOI: 10.1093/biomet/asq061</identifier><identifier>CODEN: BIOKAX</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Algorithms ; Applications ; Automobiles ; Bee clustering ; Biology, psychology, social sciences ; Cluster analysis ; Clustering ; Crossvalidation ; Datasets ; Estimate reliability ; Estimating techniques ; Exact sciences and technology ; General topics ; k-means ; Mathematics ; Multivariate analysis ; Numerical analysis ; Optimization algorithms ; Parametric inference ; Probability and statistics ; Randomness ; Sciences and techniques of general use ; Selection consistency ; Silhouettes ; Spectral clustering ; Stability ; Statistics ; Studies ; Validity ; Zero</subject><ispartof>Biometrika, 2010-12, Vol.97 (4), p.893-904</ispartof><rights>2010 Biometrika Trust</rights><rights>2015 INIST-CNRS</rights><rights>Copyright Oxford Publishing Limited(England) Dec 2010</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</citedby><cites>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/29777144$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/29777144$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,4008,27924,27925,58017,58021,58250,58254</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23651904$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttp://econpapers.repec.org/article/oupbiomet/v_3a97_3ay_3a2010_3ai_3a4_3ap_3a893-904.htm$$DView record in RePEc$$Hfree_for_read</backlink></links><search><creatorcontrib>WANG, JUNHUI</creatorcontrib><title>Consistent selection of the number of clusters via crossvalidation</title><title>Biometrika</title><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><subject>Algorithms</subject><subject>Applications</subject><subject>Automobiles</subject><subject>Bee clustering</subject><subject>Biology, psychology, social sciences</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Crossvalidation</subject><subject>Datasets</subject><subject>Estimate reliability</subject><subject>Estimating techniques</subject><subject>Exact sciences and technology</subject><subject>General topics</subject><subject>k-means</subject><subject>Mathematics</subject><subject>Multivariate analysis</subject><subject>Numerical analysis</subject><subject>Optimization algorithms</subject><subject>Parametric inference</subject><subject>Probability and statistics</subject><subject>Randomness</subject><subject>Sciences and techniques of general use</subject><subject>Selection consistency</subject><subject>Silhouettes</subject><subject>Spectral clustering</subject><subject>Stability</subject><subject>Statistics</subject><subject>Studies</subject><subject>Validity</subject><subject>Zero</subject><issn>0006-3444</issn><issn>1464-3510</issn><issn>1464-3510</issn><issn>0006-3444</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>X2L</sourceid><recordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</recordid><startdate>20101201</startdate><enddate>20101201</enddate><creator>WANG, JUNHUI</creator><general>Oxford University Press</general><general>Biometrika Trust, University College London</general><general>Oxford University Press for Biometrika Trust</general><general>Oxford Publishing Limited (England)</general><scope>BSCLL</scope><scope>IQODW</scope><scope>DKI</scope><scope>X2L</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope></search><sort><creationdate>20101201</creationdate><title>Consistent selection of the number of clusters via crossvalidation</title><author>WANG, JUNHUI</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Applications</topic><topic>Automobiles</topic><topic>Bee clustering</topic><topic>Biology, psychology, social sciences</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Crossvalidation</topic><topic>Datasets</topic><topic>Estimate reliability</topic><topic>Estimating techniques</topic><topic>Exact sciences and technology</topic><topic>General topics</topic><topic>k-means</topic><topic>Mathematics</topic><topic>Multivariate analysis</topic><topic>Numerical analysis</topic><topic>Optimization algorithms</topic><topic>Parametric inference</topic><topic>Probability and statistics</topic><topic>Randomness</topic><topic>Sciences and techniques of general use</topic><topic>Selection consistency</topic><topic>Silhouettes</topic><topic>Spectral clustering</topic><topic>Stability</topic><topic>Statistics</topic><topic>Studies</topic><topic>Validity</topic><topic>Zero</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>WANG, JUNHUI</creatorcontrib><collection>Istex</collection><collection>Pascal-Francis</collection><collection>RePEc IDEAS</collection><collection>RePEc</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><jtitle>Biometrika</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>WANG, JUNHUI</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consistent selection of the number of clusters via crossvalidation</atitle><jtitle>Biometrika</jtitle><date>2010-12-01</date><risdate>2010</risdate><volume>97</volume><issue>4</issue><spage>893</spage><epage>904</epage><pages>893-904</pages><issn>0006-3444</issn><issn>1464-3510</issn><eissn>1464-3510</eissn><eissn>0006-3444</eissn><coden>BIOKAX</coden><abstract>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><doi>10.1093/biomet/asq061</doi><tpages>12</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0006-3444
ispartof	Biometrika, 2010-12, Vol.97 (4), p.893-904
issn	0006-3444 1464-3510 1464-3510 0006-3444
language	eng
recordid	cdi_proquest_miscellaneous_1171871959
source	RePEc; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing; Oxford University Press Journals All Titles (1996-Current); Alma/SFX Local Collection
subjects	Algorithms Applications Automobiles Bee clustering Biology, psychology, social sciences Cluster analysis Clustering Crossvalidation Datasets Estimate reliability Estimating techniques Exact sciences and technology General topics k-means Mathematics Multivariate analysis Numerical analysis Optimization algorithms Parametric inference Probability and statistics Randomness Sciences and techniques of general use Selection consistency Silhouettes Spectral clustering Stability Statistics Studies Validity Zero
title	Consistent selection of the number of clusters via crossvalidation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A04%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consistent%20selection%20of%20the%20number%20of%20clusters%20via%20crossvalidation&rft.jtitle=Biometrika&rft.au=WANG,%20JUNHUI&rft.date=2010-12-01&rft.volume=97&rft.issue=4&rft.spage=893&rft.epage=904&rft.pages=893-904&rft.issn=0006-3444&rft.eissn=1464-3510&rft.coden=BIOKAX&rft_id=info:doi/10.1093/biomet/asq061&rft_dat=%3Cjstor_proqu%3E29777144%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=817786435&rft_id=info:pmid/&rft_jstor_id=29777144&rfr_iscdi=true