FilterK: A new outlier detection method for k-means clustering of physical activity
[Display omitted] •New outlier detection method for use with k-means and physical activity accelerometer data.•Comparison with three other outlier detection methods, Local outlier Function, Isolation Forests and K-Nearest Neighbours.•Efficient improvement of average cluster and event purity whilst r...
Gespeichert in:
Veröffentlicht in: | Journal of biomedical informatics 2020-04, Vol.104, p.103397-103397, Article 103397 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | [Display omitted]
•New outlier detection method for use with k-means and physical activity accelerometer data.•Comparison with three other outlier detection methods, Local outlier Function, Isolation Forests and K-Nearest Neighbours.•Efficient improvement of average cluster and event purity whilst retaining high proportion of original dataset.
In this paper, a new algorithm denoted as FilterK is proposed for improving the purity of k-means derived physical activity clusters by reducing outlier influence. We applied it to physical activity data obtained with body-worn accelerometers and clustered using k-means. We compared its performance with three existing outlier detection methods: Local Outlier Factor, Isolation Forests and KNN using the ground truth (class labels), average cluster and event purity (ACEP). FilterK provided comparable gains in ACEP (0.581 → 0.596 compared to 0.580–0.617) whilst removing a lower number of outliers than the other methods (4% total dataset size vs 10% to achieve this ACEP). The main focus of our new outlier detection method is to improve the cluster purities of physical activity accelerometer data, but we also suggest it may be potentially applied to other types of dataset captured by k-means clustering. We demonstrate our method using a k-means model trained on two independent accelerometer datasets (training n = 90) and re-applied to an independent dataset (test n = 41). Labelled physical activities include lying down, sitting, standing, household chores, walking (laboratory and non-laboratory based), stairs and running. This type of clustering algorithm could be used to assist with identifying optimal physical activity patterns for health. |
---|---|
ISSN: | 1532-0464 1532-0480 |
DOI: | 10.1016/j.jbi.2020.103397 |