Apache Flink and clustering-based framework for fast anonymization of IoT stream data

•Presenting a novel clustering approach to minimize data average delay.•Generalizing numerical data like categorical data from constructed DGH tree.•Introducing a new concept called cluster similarity threshold to anonymize remaining clusters in final stage and reduce total information loss.•Applyin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Intelligent systems with applications 2023-11, Vol.20, p.200267, Article 200267
Hauptverfasser:	Sadeghi-Nasab, Alireza, Ghaffarian, Hossein, Rahmani, Mohsen
Format:	Artikel
Sprache:	eng
Schlagworte:	Apache Flink Data anonymity Data privacy Data processing engine Internet of Things Streaming data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Presenting a novel clustering approach to minimize data average delay.•Generalizing numerical data like categorical data from constructed DGH tree.•Introducing a new concept called cluster similarity threshold to anonymize remaining clusters in final stage and reduce total information loss.•Applying Apache Flink framework to handle data streams better and faster. In this paper, we present a novel framework that considers the expiration period time of the Internet of Things (IoT) data stream to anonymize it. IoT stands among one of most fast-growing technology in the world. Also, anonymity is one of the safeguards in place to protect data privacy. Because of the dynamic nature, vastness, and rapid changes in data streams, traditional approaches cannot be used to anonymize IoT data. The anonymization framework proposed in this paper performs its operation using a new clustering method and Apache Flink flow data processing engine. In this framework, firstly, we cluster received data. Then, if the size of the clusters doesn't meet the K-anonymity threshold, our review will continue to suppress and delete them; otherwise, the data would be anonymized and published. In this way, the framework handles both numerical and categorical data. At the end of the stream, the final remaining data will be merged and anonymized. Implementing and evaluating the framework using Scala and Apache Flink shows that the proposed approach reduces data delay by 12.33–66.62% compared with the other methods. Furthermore, in the end, combining the leftover clusters avoids information loss. In comparison with similar methods, information loss is reduced by 5.68–18.26%. The evaluation results show better performance in terms of data delay and information loss.
ISSN:	2667-3053 2667-3053
DOI:	10.1016/j.iswa.2023.200267