Consistent selection of the number of clusters via crossvalidation

In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Biometrika 2010-12, Vol.97 (4), p.893-904
1. Verfasser: WANG, JUNHUI
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 904
container_issue 4
container_start_page 893
container_title Biometrika
container_volume 97
creator WANG, JUNHUI
description In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.
doi_str_mv 10.1093/biomet/asq061
format Article
fullrecord <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_miscellaneous_1171871959</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>29777144</jstor_id><sourcerecordid>29777144</sourcerecordid><originalsourceid>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</originalsourceid><addsrcrecordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>817786435</pqid></control><display><type>article</type><title>Consistent selection of the number of clusters via crossvalidation</title><source>RePEc</source><source>JSTOR Mathematics &amp; Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><source>Oxford University Press Journals All Titles (1996-Current)</source><source>Alma/SFX Local Collection</source><creator>WANG, JUNHUI</creator><creatorcontrib>WANG, JUNHUI</creatorcontrib><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><identifier>ISSN: 0006-3444</identifier><identifier>ISSN: 1464-3510</identifier><identifier>EISSN: 1464-3510</identifier><identifier>EISSN: 0006-3444</identifier><identifier>DOI: 10.1093/biomet/asq061</identifier><identifier>CODEN: BIOKAX</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Algorithms ; Applications ; Automobiles ; Bee clustering ; Biology, psychology, social sciences ; Cluster analysis ; Clustering ; Crossvalidation ; Datasets ; Estimate reliability ; Estimating techniques ; Exact sciences and technology ; General topics ; k-means ; Mathematics ; Multivariate analysis ; Numerical analysis ; Optimization algorithms ; Parametric inference ; Probability and statistics ; Randomness ; Sciences and techniques of general use ; Selection consistency ; Silhouettes ; Spectral clustering ; Stability ; Statistics ; Studies ; Validity ; Zero</subject><ispartof>Biometrika, 2010-12, Vol.97 (4), p.893-904</ispartof><rights>2010 Biometrika Trust</rights><rights>2015 INIST-CNRS</rights><rights>Copyright Oxford Publishing Limited(England) Dec 2010</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</citedby><cites>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/29777144$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/29777144$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,4008,27924,27925,58017,58021,58250,58254</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=23651904$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttp://econpapers.repec.org/article/oupbiomet/v_3a97_3ay_3a2010_3ai_3a4_3ap_3a893-904.htm$$DView record in RePEc$$Hfree_for_read</backlink></links><search><creatorcontrib>WANG, JUNHUI</creatorcontrib><title>Consistent selection of the number of clusters via crossvalidation</title><title>Biometrika</title><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><subject>Algorithms</subject><subject>Applications</subject><subject>Automobiles</subject><subject>Bee clustering</subject><subject>Biology, psychology, social sciences</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Crossvalidation</subject><subject>Datasets</subject><subject>Estimate reliability</subject><subject>Estimating techniques</subject><subject>Exact sciences and technology</subject><subject>General topics</subject><subject>k-means</subject><subject>Mathematics</subject><subject>Multivariate analysis</subject><subject>Numerical analysis</subject><subject>Optimization algorithms</subject><subject>Parametric inference</subject><subject>Probability and statistics</subject><subject>Randomness</subject><subject>Sciences and techniques of general use</subject><subject>Selection consistency</subject><subject>Silhouettes</subject><subject>Spectral clustering</subject><subject>Stability</subject><subject>Statistics</subject><subject>Studies</subject><subject>Validity</subject><subject>Zero</subject><issn>0006-3444</issn><issn>1464-3510</issn><issn>1464-3510</issn><issn>0006-3444</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>X2L</sourceid><recordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</recordid><startdate>20101201</startdate><enddate>20101201</enddate><creator>WANG, JUNHUI</creator><general>Oxford University Press</general><general>Biometrika Trust, University College London</general><general>Oxford University Press for Biometrika Trust</general><general>Oxford Publishing Limited (England)</general><scope>BSCLL</scope><scope>IQODW</scope><scope>DKI</scope><scope>X2L</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope></search><sort><creationdate>20101201</creationdate><title>Consistent selection of the number of clusters via crossvalidation</title><author>WANG, JUNHUI</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Applications</topic><topic>Automobiles</topic><topic>Bee clustering</topic><topic>Biology, psychology, social sciences</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Crossvalidation</topic><topic>Datasets</topic><topic>Estimate reliability</topic><topic>Estimating techniques</topic><topic>Exact sciences and technology</topic><topic>General topics</topic><topic>k-means</topic><topic>Mathematics</topic><topic>Multivariate analysis</topic><topic>Numerical analysis</topic><topic>Optimization algorithms</topic><topic>Parametric inference</topic><topic>Probability and statistics</topic><topic>Randomness</topic><topic>Sciences and techniques of general use</topic><topic>Selection consistency</topic><topic>Silhouettes</topic><topic>Spectral clustering</topic><topic>Stability</topic><topic>Statistics</topic><topic>Studies</topic><topic>Validity</topic><topic>Zero</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>WANG, JUNHUI</creatorcontrib><collection>Istex</collection><collection>Pascal-Francis</collection><collection>RePEc IDEAS</collection><collection>RePEc</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><jtitle>Biometrika</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>WANG, JUNHUI</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consistent selection of the number of clusters via crossvalidation</atitle><jtitle>Biometrika</jtitle><date>2010-12-01</date><risdate>2010</risdate><volume>97</volume><issue>4</issue><spage>893</spage><epage>904</epage><pages>893-904</pages><issn>0006-3444</issn><issn>1464-3510</issn><eissn>1464-3510</eissn><eissn>0006-3444</eissn><coden>BIOKAX</coden><abstract>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><doi>10.1093/biomet/asq061</doi><tpages>12</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0006-3444
ispartof Biometrika, 2010-12, Vol.97 (4), p.893-904
issn 0006-3444
1464-3510
1464-3510
0006-3444
language eng
recordid cdi_proquest_miscellaneous_1171871959
source RePEc; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing; Oxford University Press Journals All Titles (1996-Current); Alma/SFX Local Collection
subjects Algorithms
Applications
Automobiles
Bee clustering
Biology, psychology, social sciences
Cluster analysis
Clustering
Crossvalidation
Datasets
Estimate reliability
Estimating techniques
Exact sciences and technology
General topics
k-means
Mathematics
Multivariate analysis
Numerical analysis
Optimization algorithms
Parametric inference
Probability and statistics
Randomness
Sciences and techniques of general use
Selection consistency
Silhouettes
Spectral clustering
Stability
Statistics
Studies
Validity
Zero
title Consistent selection of the number of clusters via crossvalidation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A04%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consistent%20selection%20of%20the%20number%20of%20clusters%20via%20crossvalidation&rft.jtitle=Biometrika&rft.au=WANG,%20JUNHUI&rft.date=2010-12-01&rft.volume=97&rft.issue=4&rft.spage=893&rft.epage=904&rft.pages=893-904&rft.issn=0006-3444&rft.eissn=1464-3510&rft.coden=BIOKAX&rft_id=info:doi/10.1093/biomet/asq061&rft_dat=%3Cjstor_proqu%3E29777144%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=817786435&rft_id=info:pmid/&rft_jstor_id=29777144&rfr_iscdi=true