Privacy-preserving multi-party PCA computation on horizontally and vertically partitioned data based on outsourced QR decomposition
Data mining has received many applications in diverse areas such as banking, marketing, healthcare and fraud detection. One of the valuable tools in data mining is principal component analysis (PCA). Computing PCA over data belonging to several data owners with respect to their privacy is a need in...
Gespeichert in:
Veröffentlicht in: | The Journal of supercomputing 2023-09, Vol.79 (13), p.14358-14387 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data mining has received many applications in diverse areas such as banking, marketing, healthcare and fraud detection. One of the valuable tools in data mining is principal component analysis (PCA). Computing PCA over data belonging to several data owners with respect to their privacy is a need in many industries such as healthcare. Here, we propose a privacy-preserving multi-party protocol to compute PCA over horizontally and vertically distributed data using QR matrix decomposition and homomorphic encryption. Our protocol is the first privacy-preserving PCA computation scheme which is applicable for both horizontally and vertically partitioned data and finds all of the principal components. Our protocol is secure against collusion of the data owners in the semi-honest security model. In the performance analysis, we show that in the horizontal settings increasing the number of data owners will decrease the computation overhead of each of data owners, but it will increase the communication and the computation overhead of the server. We also show that the time consumption of using our proposed scheme on Australian data set of size
690
×
14
, distributed horizontally among 50 data owners, is 4.38 s. On the Ionosphere data set of size
351
×
34
, distributed horizontally among 10 data owners, it takes 31.8 s. In the vertical distribution, the time consumption of using our scheme on Gait data set of size
48
×
321
distributed among 7 data owners and on Gastrointestinal Lesions data set of size
76
×
698
distributed among 10 data owners is 4.4 h and 15.7 h, respectively. |
---|---|
ISSN: | 0920-8542 1573-0484 |
DOI: | 10.1007/s11227-023-05206-2 |