Consensus–relevance kNN and covariate shift mitigation

Classification and regression algorithms based on k -nearest neighbors (k NN ) are often ranked among the top-10 Machine learning algorithms, due to their performance, flexibility, interpretability, non-parametric nature, and computational efficiency. Nevertheless, in existing k NN algorithms, the k...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Machine learning 2024, Vol.113 (1), p.325-353
1. Verfasser:	Kalpakis, Konstantinos
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Bias Computer Science Control Datasets Flexibility K-nearest neighbors algorithm Machine Learning Mechatronics Natural Language Processing (NLP) Neighborhoods Queries Robotics Simulation and Modeling Statistical analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Classification and regression algorithms based on k -nearest neighbors (k NN ) are often ranked among the top-10 Machine learning algorithms, due to their performance, flexibility, interpretability, non-parametric nature, and computational efficiency. Nevertheless, in existing k NN algorithms, the k NN radius, which plays a major role in the quality of k NN estimates, is independent of any weights associated with the training samples in a k NN -neighborhood. This omission, besides limiting the performance and flexibility of k NN , causes difficulties in correcting for covariate shift (e.g., selection bias) in the training data, taking advantage of unlabeled data, domain adaptation and transfer learning. We propose a new weighted k NN algorithm that, given training samples, each associated with two weights, called consensus and relevance (which may depend on the query on hand as well), and a request for an estimate of the posterior at a query, works as follows. First, it determines the k NN neighborhood as the training samples within the k th relevance-weighted order statistic of the distances of the training samples from the query. Second, it uses the training samples in this neighborhood to produce the desired estimate of the posterior (output label or value) via consensus-weighted aggregation as in existing k NN rules. Furthermore, we show that k NN algorithms are affected by covariate shift, and that the commonly used sample reweighing technique does not correct covariate shift in existing k NN algorithms. We then show how to mitigate covariate shift in k NN decision rules by using instead our proposed consensus-relevance k NN algorithm with relevance weights determined by the amount of covariate shift (e.g., the ratio of sample probability densities before and after the shift). Finally, we provide experimental results, using 197 real datasets, demonstrating that the proposed approach is slightly better (in terms of F 1 score) on average than competing benchmark approaches for mitigating selection bias, and that there are quite a few datasets for which it is significantly better.
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-023-06378-x