Also for k-means: more data does not imply better performance

Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Machine learning 2023-08, Vol.112 (8), p.3033-3050
Hauptverfasser: Loog, Marco, Krijthe, Jesse H., Bicego, Manuele
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3050
container_issue 8
container_start_page 3033
container_title Machine learning
container_volume 112
creator Loog, Marco
Krijthe, Jesse H.
Bicego, Manuele
description Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k -means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k , the question remains open.
doi_str_mv 10.1007/s10994-023-06361-6
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2845340616</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2845340616</sourcerecordid><originalsourceid>FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</originalsourceid><addsrcrecordid>eNp9kEtLAzEUhYMoWB9_wFXAdTQ3r8kILkrxBQU3ug7J5I60diY1mS767506gjtXlwPfORc-Qq6A3wDn1W0BXteKcSEZN9IAM0dkBroaozb6mMy4tZoZEPqUnJWy5pwLY82M3M83JdE2ZfrJOvR9uaNdykijHzyNCQvt00BX3XazpwGHATPdYh75zvcNXpCT1m8KXv7ec_L--PC2eGbL16eXxXzJGglqYAqNqjXWEIKP0oZaR1uJYAVvIzS1lpVW0YOIIUKIokVsWgy2sg2oFkwlz8n1tLvN6WuHZXDrtMv9-NIJq7RU3IAZKTFRTU6lZGzdNq86n_cOuDtocpMmN2pyP5rcoSSnUhnh_gPz3_Q_rW8mlWof</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2845340616</pqid></control><display><type>article</type><title>Also for k-means: more data does not imply better performance</title><source>SpringerLink Journals - AutoHoldings</source><creator>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</creator><creatorcontrib>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</creatorcontrib><description>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k -means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k , the question remains open.</description><identifier>ISSN: 0885-6125</identifier><identifier>EISSN: 1573-0565</identifier><identifier>DOI: 10.1007/s10994-023-06361-6</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Cluster analysis ; Clustering ; Computer Science ; Control ; Data mining ; Hypotheses ; Learning curves ; Machine Learning ; Mechatronics ; Natural Language Processing (NLP) ; Robotics ; Sample size ; Simulation and Modeling ; Special Issue of the ECML PKDD 2023 Journal Track ; Training ; Vector quantization</subject><ispartof>Machine learning, 2023-08, Vol.112 (8), p.3033-3050</ispartof><rights>The Author(s) 2023</rights><rights>The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</cites><orcidid>0000-0002-1298-8461</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10994-023-06361-6$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10994-023-06361-6$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,777,781,27905,27906,41469,42538,51300</link.rule.ids></links><search><creatorcontrib>Loog, Marco</creatorcontrib><creatorcontrib>Krijthe, Jesse H.</creatorcontrib><creatorcontrib>Bicego, Manuele</creatorcontrib><title>Also for k-means: more data does not imply better performance</title><title>Machine learning</title><addtitle>Mach Learn</addtitle><description>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k -means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k , the question remains open.</description><subject>Artificial Intelligence</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Computer Science</subject><subject>Control</subject><subject>Data mining</subject><subject>Hypotheses</subject><subject>Learning curves</subject><subject>Machine Learning</subject><subject>Mechatronics</subject><subject>Natural Language Processing (NLP)</subject><subject>Robotics</subject><subject>Sample size</subject><subject>Simulation and Modeling</subject><subject>Special Issue of the ECML PKDD 2023 Journal Track</subject><subject>Training</subject><subject>Vector quantization</subject><issn>0885-6125</issn><issn>1573-0565</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp9kEtLAzEUhYMoWB9_wFXAdTQ3r8kILkrxBQU3ug7J5I60diY1mS767506gjtXlwPfORc-Qq6A3wDn1W0BXteKcSEZN9IAM0dkBroaozb6mMy4tZoZEPqUnJWy5pwLY82M3M83JdE2ZfrJOvR9uaNdykijHzyNCQvt00BX3XazpwGHATPdYh75zvcNXpCT1m8KXv7ec_L--PC2eGbL16eXxXzJGglqYAqNqjXWEIKP0oZaR1uJYAVvIzS1lpVW0YOIIUKIokVsWgy2sg2oFkwlz8n1tLvN6WuHZXDrtMv9-NIJq7RU3IAZKTFRTU6lZGzdNq86n_cOuDtocpMmN2pyP5rcoSSnUhnh_gPz3_Q_rW8mlWof</recordid><startdate>20230801</startdate><enddate>20230801</enddate><creator>Loog, Marco</creator><creator>Krijthe, Jesse H.</creator><creator>Bicego, Manuele</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7XB</scope><scope>88I</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M2P</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1298-8461</orcidid></search><sort><creationdate>20230801</creationdate><title>Also for k-means: more data does not imply better performance</title><author>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Computer Science</topic><topic>Control</topic><topic>Data mining</topic><topic>Hypotheses</topic><topic>Learning curves</topic><topic>Machine Learning</topic><topic>Mechatronics</topic><topic>Natural Language Processing (NLP)</topic><topic>Robotics</topic><topic>Sample size</topic><topic>Simulation and Modeling</topic><topic>Special Issue of the ECML PKDD 2023 Journal Track</topic><topic>Training</topic><topic>Vector quantization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Loog, Marco</creatorcontrib><creatorcontrib>Krijthe, Jesse H.</creatorcontrib><creatorcontrib>Bicego, Manuele</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Science Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Machine learning</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Loog, Marco</au><au>Krijthe, Jesse H.</au><au>Bicego, Manuele</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Also for k-means: more data does not imply better performance</atitle><jtitle>Machine learning</jtitle><stitle>Mach Learn</stitle><date>2023-08-01</date><risdate>2023</risdate><volume>112</volume><issue>8</issue><spage>3033</spage><epage>3050</epage><pages>3033-3050</pages><issn>0885-6125</issn><eissn>1573-0565</eissn><abstract>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k -means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k , the question remains open.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10994-023-06361-6</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0002-1298-8461</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0885-6125
ispartof Machine learning, 2023-08, Vol.112 (8), p.3033-3050
issn 0885-6125
1573-0565
language eng
recordid cdi_proquest_journals_2845340616
source SpringerLink Journals - AutoHoldings
subjects Artificial Intelligence
Cluster analysis
Clustering
Computer Science
Control
Data mining
Hypotheses
Learning curves
Machine Learning
Mechatronics
Natural Language Processing (NLP)
Robotics
Sample size
Simulation and Modeling
Special Issue of the ECML PKDD 2023 Journal Track
Training
Vector quantization
title Also for k-means: more data does not imply better performance
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T02%3A17%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Also%20for%20k-means:%20more%20data%20does%20not%20imply%20better%20performance&rft.jtitle=Machine%20learning&rft.au=Loog,%20Marco&rft.date=2023-08-01&rft.volume=112&rft.issue=8&rft.spage=3033&rft.epage=3050&rft.pages=3033-3050&rft.issn=0885-6125&rft.eissn=1573-0565&rft_id=info:doi/10.1007/s10994-023-06361-6&rft_dat=%3Cproquest_cross%3E2845340616%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2845340616&rft_id=info:pmid/&rfr_iscdi=true