Perturbation theory for cross data matrix-based PCA

Principal component analysis (PCA) has long been a useful and important tool for dimension reduction. However, this method must be used with care under certain circumstances such as high dimension and small sample size. In general, low dimension with large sample size or large signal to noise ratio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of multivariate analysis 2022-07, Vol.190, p.104960, Article 104960
Hauptverfasser: Wang, Shao-Hsuan, Huang, Su-Yun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Principal component analysis (PCA) has long been a useful and important tool for dimension reduction. However, this method must be used with care under certain circumstances such as high dimension and small sample size. In general, low dimension with large sample size or large signal to noise ratio is vital to guarantee the consistency of the leading eigenvalues and eigenvectors obtained by PCA. Cross data matrix (CDM)-based PCA is another way to estimate PCA components, through splitting data into two subsets and calculating singular value decomposition for the cross product of the corresponding covariance matrices. It has been shown that CDM-based PCA has a broader region of consistency than ordinary PCA for leading eigenvalues and eigenvectors. Although the difference in regions of consistency is well studied, an interesting practical as well as theoretical question is how they differ in eigenvalues and eigenvectors estimation, especially for the case where both fall in a common region of consistency. In this article, we derive the finite sample approximation results as well as the asymptotic behavior for CDM-based PCA via matrix perturbation. Furthermore, we also derive a comparison measure for CDM-based PCA vs. ordinary PCA. This measure only depends on the data dimension, noise correlations and the noise-to-signal ratio (NSR). Using this measure, we develop an algorithm, which selects good partitions and integrates results from these good partitions to form a final estimate for CDM-based PCA. Numerical and real data examples are presented for illustration.
ISSN:0047-259X
1095-7243
DOI:10.1016/j.jmva.2022.104960