CSViz: Class Separability Visualization for high-dimensional datasets
Data visualization is an essential task during the lifecycle of any Data Science (DS) project, particularly during the Exploratory Data Analysis (EDA) for a correct data preparation and understanding. In classification problems, data visualization is useful for revealing the existence of class separ...
Gespeichert in:
Veröffentlicht in: | Applied intelligence (Dordrecht, Netherlands) Netherlands), 2024, Vol.54 (1), p.924-946 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data visualization is an essential task during the lifecycle of any
Data Science (DS)
project, particularly during the
Exploratory Data Analysis (EDA)
for a correct data preparation and understanding. In classification problems, data visualization is useful for revealing the existence of class separability patterns within the dataset. This information is very valuable and can be later used during the process of building a
Machine Learning (ML)
model.
High-Dimensional Data (HDD)
arise as one of the biggest challenges in
DS
.
HDD
require special treatment since traditional visualization techniques, such as the
scatterplot matrix (SPLOM)
, have limitations when dealing with them due to space restrictions. Other visualization methods involve dimensionality reduction techniques, which can lead to losing important information and reducing the interpretability of the data. In this paper, the
Class Separability Visualization (CSViz)
method is introduced as a new
Visual Analytics (VA)
approach to address the challenge of visualizing labeled
HDD
through subspaces. The proposed method enables an overview of the class separability offering a series of 2-Dimensional subspaces visualizations containing exclusive subsets of points of the original variables that encompass the most valuable and significant separable patterns. The proposed method is tested over 50 datasets with different characteristics providing promising results. In all cases, more than 90% of the data observations are shown with three plots or less. Hence, the presented
CSViz
significantly eases the
EDA
by reducing the number of plots to be inspected in a
SPLOM
and thus, the amount of time invested in it.
Graphical Abstract
CSViz graphical abstract |
---|---|
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-023-05149-4 |