Privacy-preserving multi-party PCA computation on horizontally and vertically partitioned data based on outsourced QR decomposition

Data mining has received many applications in diverse areas such as banking, marketing, healthcare and fraud detection. One of the valuable tools in data mining is principal component analysis (PCA). Computing PCA over data belonging to several data owners with respect to their privacy is a need in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The Journal of supercomputing 2023-09, Vol.79 (13), p.14358-14387
Hauptverfasser:	Jaberi, Mehrad, Mala, Hamid
Format:	Artikel
Sprache:	eng
Schlagworte:	Compilers Computation Computer Science Data mining Datasets Decomposition Fraud Health care Interpreters Principal components analysis Privacy Processor Architectures Programming Languages Vertical distribution
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Data mining has received many applications in diverse areas such as banking, marketing, healthcare and fraud detection. One of the valuable tools in data mining is principal component analysis (PCA). Computing PCA over data belonging to several data owners with respect to their privacy is a need in many industries such as healthcare. Here, we propose a privacy-preserving multi-party protocol to compute PCA over horizontally and vertically distributed data using QR matrix decomposition and homomorphic encryption. Our protocol is the first privacy-preserving PCA computation scheme which is applicable for both horizontally and vertically partitioned data and finds all of the principal components. Our protocol is secure against collusion of the data owners in the semi-honest security model. In the performance analysis, we show that in the horizontal settings increasing the number of data owners will decrease the computation overhead of each of data owners, but it will increase the communication and the computation overhead of the server. We also show that the time consumption of using our proposed scheme on Australian data set of size 690 × 14 , distributed horizontally among 50 data owners, is 4.38 s. On the Ionosphere data set of size 351 × 34 , distributed horizontally among 10 data owners, it takes 31.8 s. In the vertical distribution, the time consumption of using our scheme on Gait data set of size 48 × 321 distributed among 7 data owners and on Gastrointestinal Lesions data set of size 76 × 698 distributed among 10 data owners is 4.4 h and 15.7 h, respectively.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-023-05206-2