Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researche...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Data mining and knowledge discovery 2014-09, Vol.28 (5-6), p.1158-1188
Hauptverfasser:	Hajian, Sara, Domingo-Ferrer, Josep, Farràs, Oriol
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Automation Chemistry and Earth Sciences Classifiers Computer Science Customers Data analysis Data mining Data Mining and Knowledge Discovery Datasets Decision making Discrimination Females Information society Information Storage and Retrieval Learning Legislation Personnel selection Physics Policies Preservation Prevention Privacy Publishing Statistics for Engineering Working hours
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.
ISSN:	1384-5810 1573-756X
DOI:	10.1007/s10618-014-0346-1