Scalable Hierarchical Agglomerative Clustering

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomera...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-09
Hauptverfasser:	Monath, Nicholas, Dubey, Avinava, Guru Guruganesh, Manzil Zaheer, Ahmed, Amr, McCallum, Andrew, Mergen, Gokhan, Najork, Marc, Terzihan, Mert, Tjanaka, Bryon, Wang, Yuan, Wu, Yuchen
Format:	Artikel
Sprache:	eng
Schlagworte:	Agglomeration Algorithms Approximation Cluster analysis Clustering Computer Science - Learning Datasets Hierarchies Massive data points
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Monath, Nicholas Dubey, Avinava Guru Guruganesh Manzil Zaheer Ahmed, Amr McCallum, Andrew Mergen, Gokhan Najork, Marc Terzihan, Mert Tjanaka, Bryon Wang, Yuan Wu, Yuchen
description	The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
doi_str_mv	10.48550/arxiv.2010.11821
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2010_11821</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2453835197</sourcerecordid><originalsourceid>FETCH-LOGICAL-a527-d20d2d9809f05e856ab88124d8b21477dc85ebce76309a7f0d28c1eeb9ec8aa73</originalsourceid><addsrcrecordid>eNotj0FLw0AQRhdBaKn9AZ4a8Jy4O5vNTo4lqC0UPNh72GwmcUva1E1S9N-7tp4GHo-PeYw9Cp6kqBR_Nv7bXRLgAQiBIO7YHKQUMaYAM7YchgPnHDINSsk5Sz6s6UzVUbRx5I23ny6AaN22XX8MYHQXiopuGkby7tQ-sPvGdAMt_--C7V9f9sUm3r2_bYv1LjYKdFwDr6HOkecNV4QqMxWigLTGCkSqdW1RUWVJZ5LnRjfBRiuIqpwsGqPlgq1us9eY8uzd0fif8i-qvEYF4-lmnH3_NdEwlod-8qfwUwmpkiiVyLX8BTy6Tos</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2453835197</pqid></control><display><type>article</type><title>Scalable Hierarchical Agglomerative Clustering</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Monath, Nicholas ; Dubey, Avinava ; Guru Guruganesh ; Manzil Zaheer ; Ahmed, Amr ; McCallum, Andrew ; Mergen, Gokhan ; Najork, Marc ; Terzihan, Mert ; Tjanaka, Bryon ; Wang, Yuan ; Wu, Yuchen</creator><creatorcontrib>Monath, Nicholas ; Dubey, Avinava ; Guru Guruganesh ; Manzil Zaheer ; Ahmed, Amr ; McCallum, Andrew ; Mergen, Gokhan ; Najork, Marc ; Terzihan, Mert ; Tjanaka, Bryon ; Wang, Yuan ; Wu, Yuchen</creatorcontrib><description>The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2010.11821</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Agglomeration ; Algorithms ; Approximation ; Cluster analysis ; Clustering ; Computer Science - Learning ; Datasets ; Hierarchies ; Massive data points</subject><ispartof>arXiv.org, 2021-09</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.1145/3447548.3467404$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2010.11821$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Monath, Nicholas</creatorcontrib><creatorcontrib>Dubey, Avinava</creatorcontrib><creatorcontrib>Guru Guruganesh</creatorcontrib><creatorcontrib>Manzil Zaheer</creatorcontrib><creatorcontrib>Ahmed, Amr</creatorcontrib><creatorcontrib>McCallum, Andrew</creatorcontrib><creatorcontrib>Mergen, Gokhan</creatorcontrib><creatorcontrib>Najork, Marc</creatorcontrib><creatorcontrib>Terzihan, Mert</creatorcontrib><creatorcontrib>Tjanaka, Bryon</creatorcontrib><creatorcontrib>Wang, Yuan</creatorcontrib><creatorcontrib>Wu, Yuchen</creatorcontrib><title>Scalable Hierarchical Agglomerative Clustering</title><title>arXiv.org</title><description>The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.</description><subject>Agglomeration</subject><subject>Algorithms</subject><subject>Approximation</subject><subject>Cluster analysis</subject><subject>Clustering</subject><subject>Computer Science - Learning</subject><subject>Datasets</subject><subject>Hierarchies</subject><subject>Massive data points</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj0FLw0AQRhdBaKn9AZ4a8Jy4O5vNTo4lqC0UPNh72GwmcUva1E1S9N-7tp4GHo-PeYw9Cp6kqBR_Nv7bXRLgAQiBIO7YHKQUMaYAM7YchgPnHDINSsk5Sz6s6UzVUbRx5I23ny6AaN22XX8MYHQXiopuGkby7tQ-sPvGdAMt_--C7V9f9sUm3r2_bYv1LjYKdFwDr6HOkecNV4QqMxWigLTGCkSqdW1RUWVJZ5LnRjfBRiuIqpwsGqPlgq1us9eY8uzd0fif8i-qvEYF4-lmnH3_NdEwlod-8qfwUwmpkiiVyLX8BTy6Tos</recordid><startdate>20210930</startdate><enddate>20210930</enddate><creator>Monath, Nicholas</creator><creator>Dubey, Avinava</creator><creator>Guru Guruganesh</creator><creator>Manzil Zaheer</creator><creator>Ahmed, Amr</creator><creator>McCallum, Andrew</creator><creator>Mergen, Gokhan</creator><creator>Najork, Marc</creator><creator>Terzihan, Mert</creator><creator>Tjanaka, Bryon</creator><creator>Wang, Yuan</creator><creator>Wu, Yuchen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210930</creationdate><title>Scalable Hierarchical Agglomerative Clustering</title><author>Monath, Nicholas ; Dubey, Avinava ; Guru Guruganesh ; Manzil Zaheer ; Ahmed, Amr ; McCallum, Andrew ; Mergen, Gokhan ; Najork, Marc ; Terzihan, Mert ; Tjanaka, Bryon ; Wang, Yuan ; Wu, Yuchen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a527-d20d2d9809f05e856ab88124d8b21477dc85ebce76309a7f0d28c1eeb9ec8aa73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Agglomeration</topic><topic>Algorithms</topic><topic>Approximation</topic><topic>Cluster analysis</topic><topic>Clustering</topic><topic>Computer Science - Learning</topic><topic>Datasets</topic><topic>Hierarchies</topic><topic>Massive data points</topic><toplevel>online_resources</toplevel><creatorcontrib>Monath, Nicholas</creatorcontrib><creatorcontrib>Dubey, Avinava</creatorcontrib><creatorcontrib>Guru Guruganesh</creatorcontrib><creatorcontrib>Manzil Zaheer</creatorcontrib><creatorcontrib>Ahmed, Amr</creatorcontrib><creatorcontrib>McCallum, Andrew</creatorcontrib><creatorcontrib>Mergen, Gokhan</creatorcontrib><creatorcontrib>Najork, Marc</creatorcontrib><creatorcontrib>Terzihan, Mert</creatorcontrib><creatorcontrib>Tjanaka, Bryon</creatorcontrib><creatorcontrib>Wang, Yuan</creatorcontrib><creatorcontrib>Wu, Yuchen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Monath, Nicholas</au><au>Dubey, Avinava</au><au>Guru Guruganesh</au><au>Manzil Zaheer</au><au>Ahmed, Amr</au><au>McCallum, Andrew</au><au>Mergen, Gokhan</au><au>Najork, Marc</au><au>Terzihan, Mert</au><au>Tjanaka, Bryon</au><au>Wang, Yuan</au><au>Wu, Yuchen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable Hierarchical Agglomerative Clustering</atitle><jtitle>arXiv.org</jtitle><date>2021-09-30</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2010.11821</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-09
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2010_11821
source	arXiv.org; Free E- Journals
subjects	Agglomeration Algorithms Approximation Cluster analysis Clustering Computer Science - Learning Datasets Hierarchies Massive data points
title	Scalable Hierarchical Agglomerative Clustering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T10%3A55%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20Hierarchical%20Agglomerative%20Clustering&rft.jtitle=arXiv.org&rft.au=Monath,%20Nicholas&rft.date=2021-09-30&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2010.11821&rft_dat=%3Cproquest_arxiv%3E2453835197%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2453835197&rft_id=info:pmid/&rfr_iscdi=true