Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest
Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Clustering algorithms are among the most widely used data mining methods due
to their exploratory power and being an initial preprocessing step that paves
the way for other techniques. But the problem of calculating the optimal number
of clusters (say k) is one of the significant challenges for such methods. The
most widely used clustering algorithms like k-means and k-shape in time series
data mining also need the ground truth for the number of clusters that need to
be generated. In this work, we extended the Symbolic Pattern Forest algorithm,
another time series clustering algorithm, to determine the optimal number of
clusters for the time series datasets. We used SPF to generate the clusters
from the datasets and chose the optimal number of clusters based on the
Silhouette Coefficient, a metric used to calculate the goodness of a clustering
technique. Silhouette was calculated on both the bag of word vectors and the
tf-idf vectors generated from the SAX words of each time series. We tested our
approach on the UCR archive datasets, and our experimental results so far
showed significant improvement over the baseline. |
---|---|
DOI: | 10.48550/arxiv.2310.00820 |