Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest
Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-10 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Md Nishat Raihan |
description | Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2871973335</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2871973335</sourcerecordid><originalsourceid>FETCH-proquest_journals_28719733353</originalsourceid><addsrcrecordid>eNqNysEKgkAUQNEhCJLyHx60FnQm09aatKrA9jbGM0ccx-aNRH-fiz6g1V3cs2AeFyIK0h3nK-YTdWEY8n3C41h47J6jQ6vVoIYnuBbhMjqlZQ_nSddowTSQ9RPNhqAxFm5KI5RoFRLk0klCR_BWroXyo2vTqwdcpZv5AIWxSG7Dlo3sCf1f12xbHG_ZKRiteU0zqDoz2WFeFU-T6JAIIWLxn_oC2DVEzg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2871973335</pqid></control><display><type>article</type><title>Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest</title><source>Free E- Journals</source><creator>Md Nishat Raihan</creator><creatorcontrib>Md Nishat Raihan</creatorcontrib><description>Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Clustering ; Data mining ; Datasets ; Mathematical analysis ; Time series</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Md Nishat Raihan</creatorcontrib><title>Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest</title><title>arXiv.org</title><description>Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.</description><subject>Algorithms</subject><subject>Clustering</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Mathematical analysis</subject><subject>Time series</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNysEKgkAUQNEhCJLyHx60FnQm09aatKrA9jbGM0ccx-aNRH-fiz6g1V3cs2AeFyIK0h3nK-YTdWEY8n3C41h47J6jQ6vVoIYnuBbhMjqlZQ_nSddowTSQ9RPNhqAxFm5KI5RoFRLk0klCR_BWroXyo2vTqwdcpZv5AIWxSG7Dlo3sCf1f12xbHG_ZKRiteU0zqDoz2WFeFU-T6JAIIWLxn_oC2DVEzg</recordid><startdate>20231001</startdate><enddate>20231001</enddate><creator>Md Nishat Raihan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231001</creationdate><title>Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest</title><author>Md Nishat Raihan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28719733353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Clustering</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Mathematical analysis</topic><topic>Time series</topic><toplevel>online_resources</toplevel><creatorcontrib>Md Nishat Raihan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Md Nishat Raihan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest</atitle><jtitle>arXiv.org</jtitle><date>2023-10-01</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2871973335 |
source | Free E- Journals |
subjects | Algorithms Clustering Data mining Datasets Mathematical analysis Time series |
title | Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T13%3A01%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Determining%20the%20Optimal%20Number%20of%20Clusters%20for%20Time%20Series%20Datasets%20with%20Symbolic%20Pattern%20Forest&rft.jtitle=arXiv.org&rft.au=Md%20Nishat%20Raihan&rft.date=2023-10-01&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2871973335%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2871973335&rft_id=info:pmid/&rfr_iscdi=true |