CSViz: Class Separability Visualization for high-dimensional datasets

Data visualization is an essential task during the lifecycle of any Data Science (DS) project, particularly during the Exploratory Data Analysis (EDA) for a correct data preparation and understanding. In classification problems, data visualization is useful for revealing the existence of class separ...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2024, Vol.54 (1), p.924-946
Hauptverfasser: Cuesta, Marina, Lancho, Carmen, Fernández-Isabel, Alberto, Cano, Emilio L., Martín De Diego, Isaac
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Data visualization is an essential task during the lifecycle of any Data Science (DS) project, particularly during the Exploratory Data Analysis (EDA) for a correct data preparation and understanding. In classification problems, data visualization is useful for revealing the existence of class separability patterns within the dataset. This information is very valuable and can be later used during the process of building a Machine Learning (ML) model. High-Dimensional Data (HDD) arise as one of the biggest challenges in DS . HDD require special treatment since traditional visualization techniques, such as the scatterplot matrix (SPLOM) , have limitations when dealing with them due to space restrictions. Other visualization methods involve dimensionality reduction techniques, which can lead to losing important information and reducing the interpretability of the data. In this paper, the Class Separability Visualization (CSViz) method is introduced as a new Visual Analytics (VA) approach to address the challenge of visualizing labeled HDD through subspaces. The proposed method enables an overview of the class separability offering a series of 2-Dimensional subspaces visualizations containing exclusive subsets of points of the original variables that encompass the most valuable and significant separable patterns. The proposed method is tested over 50 datasets with different characteristics providing promising results. In all cases, more than 90% of the data observations are shown with three plots or less. Hence, the presented CSViz significantly eases the EDA by reducing the number of plots to be inspected in a SPLOM and thus, the amount of time invested in it. Graphical Abstract CSViz graphical abstract
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-023-05149-4