Also for k-means: more data does not imply better performance
Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show...
Gespeichert in:
Veröffentlicht in: | Machine learning 2023-08, Vol.112 (8), p.3033-3050 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 3050 |
---|---|
container_issue | 8 |
container_start_page | 3033 |
container_title | Machine learning |
container_volume | 112 |
creator | Loog, Marco Krijthe, Jesse H. Bicego, Manuele |
description | Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that
k
-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger
k
, the question remains open. |
doi_str_mv | 10.1007/s10994-023-06361-6 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2845340616</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2845340616</sourcerecordid><originalsourceid>FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</originalsourceid><addsrcrecordid>eNp9kEtLAzEUhYMoWB9_wFXAdTQ3r8kILkrxBQU3ug7J5I60diY1mS767506gjtXlwPfORc-Qq6A3wDn1W0BXteKcSEZN9IAM0dkBroaozb6mMy4tZoZEPqUnJWy5pwLY82M3M83JdE2ZfrJOvR9uaNdykijHzyNCQvt00BX3XazpwGHATPdYh75zvcNXpCT1m8KXv7ec_L--PC2eGbL16eXxXzJGglqYAqNqjXWEIKP0oZaR1uJYAVvIzS1lpVW0YOIIUKIokVsWgy2sg2oFkwlz8n1tLvN6WuHZXDrtMv9-NIJq7RU3IAZKTFRTU6lZGzdNq86n_cOuDtocpMmN2pyP5rcoSSnUhnh_gPz3_Q_rW8mlWof</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2845340616</pqid></control><display><type>article</type><title>Also for k-means: more data does not imply better performance</title><source>SpringerNature Journals</source><creator>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</creator><creatorcontrib>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</creatorcontrib><description>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that
k
-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger
k
, the question remains open.</description><identifier>ISSN: 0885-6125</identifier><identifier>EISSN: 1573-0565</identifier><identifier>DOI: 10.1007/s10994-023-06361-6</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Cluster analysis ; Clustering ; Computer Science ; Control ; Data mining ; Hypotheses ; Learning curves ; Machine Learning ; Mechatronics ; Natural Language Processing (NLP) ; Robotics ; Sample size ; Simulation and Modeling ; Special Issue of the ECML PKDD 2023 Journal Track ; Training ; Vector quantization</subject><ispartof>Machine learning, 2023-08, Vol.112 (8), p.3033-3050</ispartof><rights>The Author(s) 2023</rights><rights>The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</cites><orcidid>0000-0002-1298-8461</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10994-023-06361-6$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10994-023-06361-6$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>315,781,785,27929,27930,41493,42562,51324</link.rule.ids></links><search><creatorcontrib>Loog, Marco</creatorcontrib><creatorcontrib>Krijthe, Jesse H.</creatorcontrib><creatorcontrib>Bicego, Manuele</creatorcontrib><title>Also for k-means: more data does not imply better performance</title><title>Machine learning</title><addtitle>Mach Learn</addtitle><description>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that
k
-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger
k
, the question remains open.</description><subject>Artificial Intelligence</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Computer Science</subject><subject>Control</subject><subject>Data mining</subject><subject>Hypotheses</subject><subject>Learning curves</subject><subject>Machine Learning</subject><subject>Mechatronics</subject><subject>Natural Language Processing (NLP)</subject><subject>Robotics</subject><subject>Sample size</subject><subject>Simulation and Modeling</subject><subject>Special Issue of the ECML PKDD 2023 Journal Track</subject><subject>Training</subject><subject>Vector quantization</subject><issn>0885-6125</issn><issn>1573-0565</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp9kEtLAzEUhYMoWB9_wFXAdTQ3r8kILkrxBQU3ug7J5I60diY1mS767506gjtXlwPfORc-Qq6A3wDn1W0BXteKcSEZN9IAM0dkBroaozb6mMy4tZoZEPqUnJWy5pwLY82M3M83JdE2ZfrJOvR9uaNdykijHzyNCQvt00BX3XazpwGHATPdYh75zvcNXpCT1m8KXv7ec_L--PC2eGbL16eXxXzJGglqYAqNqjXWEIKP0oZaR1uJYAVvIzS1lpVW0YOIIUKIokVsWgy2sg2oFkwlz8n1tLvN6WuHZXDrtMv9-NIJq7RU3IAZKTFRTU6lZGzdNq86n_cOuDtocpMmN2pyP5rcoSSnUhnh_gPz3_Q_rW8mlWof</recordid><startdate>20230801</startdate><enddate>20230801</enddate><creator>Loog, Marco</creator><creator>Krijthe, Jesse H.</creator><creator>Bicego, Manuele</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7XB</scope><scope>88I</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M2P</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1298-8461</orcidid></search><sort><creationdate>20230801</creationdate><title>Also for k-means: more data does not imply better performance</title><author>Loog, Marco ; Krijthe, Jesse H. ; Bicego, Manuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c314t-4e6495e91bbad38b95d872b820fd1c953754da12dbd1bd2feecfeb878c14f1673</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Computer Science</topic><topic>Control</topic><topic>Data mining</topic><topic>Hypotheses</topic><topic>Learning curves</topic><topic>Machine Learning</topic><topic>Mechatronics</topic><topic>Natural Language Processing (NLP)</topic><topic>Robotics</topic><topic>Sample size</topic><topic>Simulation and Modeling</topic><topic>Special Issue of the ECML PKDD 2023 Journal Track</topic><topic>Training</topic><topic>Vector quantization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Loog, Marco</creatorcontrib><creatorcontrib>Krijthe, Jesse H.</creatorcontrib><creatorcontrib>Bicego, Manuele</creatorcontrib><collection>Springer Nature OA/Free Journals</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Science Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Machine learning</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Loog, Marco</au><au>Krijthe, Jesse H.</au><au>Bicego, Manuele</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Also for k-means: more data does not imply better performance</atitle><jtitle>Machine learning</jtitle><stitle>Mach Learn</stitle><date>2023-08-01</date><risdate>2023</risdate><volume>112</volume><issue>8</issue><spage>3033</spage><epage>3050</epage><pages>3033-3050</pages><issn>0885-6125</issn><eissn>1573-0565</eissn><abstract>Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that
k
-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger
k
, the question remains open.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10994-023-06361-6</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0002-1298-8461</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0885-6125 |
ispartof | Machine learning, 2023-08, Vol.112 (8), p.3033-3050 |
issn | 0885-6125 1573-0565 |
language | eng |
recordid | cdi_proquest_journals_2845340616 |
source | SpringerNature Journals |
subjects | Artificial Intelligence Cluster analysis Clustering Computer Science Control Data mining Hypotheses Learning curves Machine Learning Mechatronics Natural Language Processing (NLP) Robotics Sample size Simulation and Modeling Special Issue of the ECML PKDD 2023 Journal Track Training Vector quantization |
title | Also for k-means: more data does not imply better performance |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T22%3A27%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Also%20for%20k-means:%20more%20data%20does%20not%20imply%20better%20performance&rft.jtitle=Machine%20learning&rft.au=Loog,%20Marco&rft.date=2023-08-01&rft.volume=112&rft.issue=8&rft.spage=3033&rft.epage=3050&rft.pages=3033-3050&rft.issn=0885-6125&rft.eissn=1573-0565&rft_id=info:doi/10.1007/s10994-023-06361-6&rft_dat=%3Cproquest_cross%3E2845340616%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2845340616&rft_id=info:pmid/&rfr_iscdi=true |