Anonymizing bag-valued sparse data by semantic similarity-based clustering

Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k -anonymity, which prevents the privacy attack of re-identification that reveals the real identities o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge and information systems 2013-05, Vol.35 (2), p.435-461
Hauptverfasser: Liu, Junqiang, Wang, Ke
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k -anonymity, which prevents the privacy attack of re-identification that reveals the real identities of vocabularies. A vocabulary is a bag of query-terms extracted from queries issued by a user at a specified granularity. Such bag-valued data are extremely sparse, which makes it hard to retain enough utility in enforcing k -anonymity. To the best of our knowledge, the prior works do not solve such a problem, among which some achieve a different privacy principle, for example, differential privacy, some deal with a different type of data, for example, set-valued data or relational data, and some consider a different publication scenario, for example, publishing frequent keywords. To retain enough data utility, a semantic similarity-based clustering approach is proposed, which measures the semantic similarity between a pair of terms by the minimum path distance over a semantic network of terms such as WordNet, computes the semantic similarity between two vocabularies by a weighted bipartite matching, and publishes the typical vocabulary for each cluster of semantically similar vocabularies. Extensive experiments on the AOL query log show that our approach can retain enough data utility in terms of loss metrics and in frequent pattern mining.
ISSN:0219-1377
0219-3116
DOI:10.1007/s10115-012-0515-8