Generating multidimensional clusters with support lines

Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem’s space. In turn, synthetic data generators have the potential of creating vast amounts of data—a crucial activity when real-world data i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge-based systems 2023-10, Vol.277, p.110836, Article 110836
Hauptverfasser:	Fachada, Nuno, de Andrade, Diogo
Format:	Artikel
Sprache:	eng
Schlagworte:	Clustering Data generation Multidimensional data Synthetic data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem’s space. In turn, synthetic data generators have the potential of creating vast amounts of data—a crucial activity when real-world data is at premium—while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2023.110836