Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection

Outlier detection in high-dimensional data presents various challenges resulting from the "curse of dimensionality." A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by maki...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on knowledge and data engineering 2015-05, Vol.27 (5), p.1369-1382
Hauptverfasser:	Radovanovic, Milos, Nanopoulos, Alexandros, Ivanovic, Mirjana
Format:	Artikel
Sprache:	eng
Schlagworte:	Context Correlation Counting Data analysis distance concentration Educational institutions Estimating techniques Euclidean distance high-dimensional data Histograms Labels Lists Methods Noise measurement Outlier detection Outliers (statistics) Production methods reverse nearest neighbors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Outlier detection in high-dimensional data presents various challenges resulting from the "curse of dimensionality." A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper, we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points' reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique designed for high-dimensional data, the density-based local outlier factor and influenced outlierness methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2014.2365790