Consistent selection of the number of clusters via crossvalidation
In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms,...
Gespeichert in:
Veröffentlicht in: | Biometrika 2010-12, Vol.97 (4), p.893-904 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 904 |
---|---|
container_issue | 4 |
container_start_page | 893 |
container_title | Biometrika |
container_volume | 97 |
creator | WANG, JUNHUI |
description | In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split. |
doi_str_mv | 10.1093/biomet/asq061 |
format | Article |
fullrecord | <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_miscellaneous_1171871959</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>29777144</jstor_id><sourcerecordid>29777144</sourcerecordid><originalsourceid>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</originalsourceid><addsrcrecordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>817786435</pqid></control><display><type>article</type><title>Consistent selection of the number of clusters via crossvalidation</title><source>RePEc</source><source>JSTOR Mathematics & Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><source>Oxford University Press Journals All Titles (1996-Current)</source><source>Alma/SFX Local Collection</source><creator>WANG, JUNHUI</creator><creatorcontrib>WANG, JUNHUI</creatorcontrib><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><identifier>ISSN: 0006-3444</identifier><identifier>ISSN: 1464-3510</identifier><identifier>EISSN: 1464-3510</identifier><identifier>EISSN: 0006-3444</identifier><identifier>DOI: 10.1093/biomet/asq061</identifier><identifier>CODEN: BIOKAX</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Algorithms ; Applications ; Automobiles ; Bee clustering ; Biology, psychology, social sciences ; Cluster analysis ; Clustering ; Crossvalidation ; Datasets ; Estimate reliability ; Estimating techniques ; Exact sciences and technology ; General topics ; k-means ; Mathematics ; Multivariate analysis ; Numerical analysis ; Optimization algorithms ; Parametric inference ; Probability and statistics ; Randomness ; Sciences and techniques of general use ; Selection consistency ; Silhouettes ; Spectral clustering ; Stability ; Statistics ; Studies ; Validity ; Zero</subject><ispartof>Biometrika, 2010-12, Vol.97 (4), p.893-904</ispartof><rights>2010 Biometrika Trust</rights><rights>2015 INIST-CNRS</rights><rights>Copyright Oxford Publishing Limited(England) Dec 2010</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</citedby><cites>FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/29777144$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/29777144$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,4008,27924,27925,58017,58021,58250,58254</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23651904$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttp://econpapers.repec.org/article/oupbiomet/v_3a97_3ay_3a2010_3ai_3a4_3ap_3a893-904.htm$$DView record in RePEc$$Hfree_for_read</backlink></links><search><creatorcontrib>WANG, JUNHUI</creatorcontrib><title>Consistent selection of the number of clusters via crossvalidation</title><title>Biometrika</title><description>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</description><subject>Algorithms</subject><subject>Applications</subject><subject>Automobiles</subject><subject>Bee clustering</subject><subject>Biology, psychology, social sciences</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Crossvalidation</subject><subject>Datasets</subject><subject>Estimate reliability</subject><subject>Estimating techniques</subject><subject>Exact sciences and technology</subject><subject>General topics</subject><subject>k-means</subject><subject>Mathematics</subject><subject>Multivariate analysis</subject><subject>Numerical analysis</subject><subject>Optimization algorithms</subject><subject>Parametric inference</subject><subject>Probability and statistics</subject><subject>Randomness</subject><subject>Sciences and techniques of general use</subject><subject>Selection consistency</subject><subject>Silhouettes</subject><subject>Spectral clustering</subject><subject>Stability</subject><subject>Statistics</subject><subject>Studies</subject><subject>Validity</subject><subject>Zero</subject><issn>0006-3444</issn><issn>1464-3510</issn><issn>1464-3510</issn><issn>0006-3444</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>X2L</sourceid><recordid>eNpdkdGL1DAQxoMouJ4--igUQfCld0mTJs2j7umdsiKI4uFLSNMJl7Vtekm6eP-9qV32wIcvYZgfHzPfIPSS4HOCJb1onR8gXeh4hzl5hDaEcVbSmuDHaIMx5iVljD1Fz2LcLyWv-Qa93_oxuphgTEWEHkxyfiy8LdItFOM8tBCWyvRzZkIsDk4XJvgYD7p3nV7o5-iJ1X2EF8f_DP34-OH79rrcfb36tH23K01d4VRWtbaUGAxM2Ka11koMhGMhDJNVBbztcsfoToDEtKmkEaLtDO2w4caQCugZerv6TsHfzRCTGlw00Pd6BD9HRYggjSCylhl9_R-693MY83SqIUI0nNE6Q-UK_dsngFVTcIMO94pgtQSq1kDVGmjmP698gAnMCfbzdOQOimop8nOfVeHsQrXLYllTVpMtJWbqNg3Z7M1xQh2N7m3Qo3HxZFpRXpPMZu7Vyu1j8uGhL4UQhLGHJZYb_jn1dfituKCiVtc3vxT-cnP5k3_bKUz_AsRdqoA</recordid><startdate>20101201</startdate><enddate>20101201</enddate><creator>WANG, JUNHUI</creator><general>Oxford University Press</general><general>Biometrika Trust, University College London</general><general>Oxford University Press for Biometrika Trust</general><general>Oxford Publishing Limited (England)</general><scope>BSCLL</scope><scope>IQODW</scope><scope>DKI</scope><scope>X2L</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope></search><sort><creationdate>20101201</creationdate><title>Consistent selection of the number of clusters via crossvalidation</title><author>WANG, JUNHUI</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c520t-25af31c0e47f8bfff90e16077c4922e6bde47cad7e903829c77bdc3d0c6cc12e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithms</topic><topic>Applications</topic><topic>Automobiles</topic><topic>Bee clustering</topic><topic>Biology, psychology, social sciences</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Crossvalidation</topic><topic>Datasets</topic><topic>Estimate reliability</topic><topic>Estimating techniques</topic><topic>Exact sciences and technology</topic><topic>General topics</topic><topic>k-means</topic><topic>Mathematics</topic><topic>Multivariate analysis</topic><topic>Numerical analysis</topic><topic>Optimization algorithms</topic><topic>Parametric inference</topic><topic>Probability and statistics</topic><topic>Randomness</topic><topic>Sciences and techniques of general use</topic><topic>Selection consistency</topic><topic>Silhouettes</topic><topic>Spectral clustering</topic><topic>Stability</topic><topic>Statistics</topic><topic>Studies</topic><topic>Validity</topic><topic>Zero</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>WANG, JUNHUI</creatorcontrib><collection>Istex</collection><collection>Pascal-Francis</collection><collection>RePEc IDEAS</collection><collection>RePEc</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><jtitle>Biometrika</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>WANG, JUNHUI</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consistent selection of the number of clusters via crossvalidation</atitle><jtitle>Biometrika</jtitle><date>2010-12-01</date><risdate>2010</risdate><volume>97</volume><issue>4</issue><spage>893</spage><epage>904</epage><pages>893-904</pages><issn>0006-3444</issn><issn>1464-3510</issn><eissn>1464-3510</eissn><eissn>0006-3444</eissn><coden>BIOKAX</coden><abstract>In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><doi>10.1093/biomet/asq061</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0006-3444 |
ispartof | Biometrika, 2010-12, Vol.97 (4), p.893-904 |
issn | 0006-3444 1464-3510 1464-3510 0006-3444 |
language | eng |
recordid | cdi_proquest_miscellaneous_1171871959 |
source | RePEc; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing; Oxford University Press Journals All Titles (1996-Current); Alma/SFX Local Collection |
subjects | Algorithms Applications Automobiles Bee clustering Biology, psychology, social sciences Cluster analysis Clustering Crossvalidation Datasets Estimate reliability Estimating techniques Exact sciences and technology General topics k-means Mathematics Multivariate analysis Numerical analysis Optimization algorithms Parametric inference Probability and statistics Randomness Sciences and techniques of general use Selection consistency Silhouettes Spectral clustering Stability Statistics Studies Validity Zero |
title | Consistent selection of the number of clusters via crossvalidation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A04%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consistent%20selection%20of%20the%20number%20of%20clusters%20via%20crossvalidation&rft.jtitle=Biometrika&rft.au=WANG,%20JUNHUI&rft.date=2010-12-01&rft.volume=97&rft.issue=4&rft.spage=893&rft.epage=904&rft.pages=893-904&rft.issn=0006-3444&rft.eissn=1464-3510&rft.coden=BIOKAX&rft_id=info:doi/10.1093/biomet/asq061&rft_dat=%3Cjstor_proqu%3E29777144%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=817786435&rft_id=info:pmid/&rft_jstor_id=29777144&rfr_iscdi=true |