Anomaly detection with correlation laws

Datasets from different domains usually contain data defined over a wide set of attributes among which various degrees of correlation exist. The identification of data objects not complying with these hidden correlations is a formidable task. Moreover, often attributes may play different roles in ap...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Data & knowledge engineering 2023-05, Vol.145, p.102181, Article 102181
Hauptverfasser: Angiulli, Fabrizio, Fassetti, Fabio, Serrao, Cristina
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Datasets from different domains usually contain data defined over a wide set of attributes among which various degrees of correlation exist. The identification of data objects not complying with these hidden correlations is a formidable task. Moreover, often attributes may play different roles in applications. Specifically, some features can be perceived as independent variables which are responsible for the definition of a context in which a dependent variable exhibits anomalous behaving values. Hence, in this work we focus on the detection of data objects showing an anomalous behavior on a subset of attributes, called behavioral, w.r.t. some other ones, called contextual. As a main contribution, we design a model to describe the correlation laws hidden in data distributions over pairs of behavioral–contextual attributes. We introduce a probability measure aimed at scoring subsequently observed objects based on how much their behavior deviates from the detected correlation laws. We test our method on both synthetic and real dataset to demonstrate its effectiveness and show its ability in outperforming some competitors. Moreover, we discuss a case study in the field of gene expression data analysis to prove that it can provide a valuable contribution when dealing with those scenarios in which the features are much more abundant than the samples available for the analysis.
ISSN:0169-023X
1872-6933
DOI:10.1016/j.datak.2023.102181