Classification for high-dimension low-sample size data

•The cause of data-piling is derived on High Dimensional Low Sample Size (HDLSS) data sets.•A novel classification criterion on HDLSS, tolerance similarity is proposed.•Leveraging on this criterion, a novel linear binary classifier (NPDMD) is designed.•NPDMD is suitable for different real-world appl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2022-10, Vol.130, p.108828, Article 108828
Hauptverfasser:	Shen, Liran, Er, Meng Joo, Yin, Qingbo
Format:	Artikel
Sprache:	eng
Schlagworte:	Binary linear classifier Covariance matrix Data piling Quadratic programming
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•The cause of data-piling is derived on High Dimensional Low Sample Size (HDLSS) data sets.•A novel classification criterion on HDLSS, tolerance similarity is proposed.•Leveraging on this criterion, a novel linear binary classifier (NPDMD) is designed.•NPDMD is suitable for different real-world applications. High-dimension and low-sample-size (HDLSS) data sets have posed great challenges to many machine learning methods. To deal with practical HDLSS problems, development of new classification techniques is highly desired. After the cause of the over-fitting phenomenon is identified, a new classification criterion for HDLSS data sets, termed tolerance similarity, is proposed to emphasize maximization of within-class variance on the premise of class separability. Leveraging on this criterion, a novel linear binary classifier, termed No-separated Data Maximum Dispersion classifier (NPDMD), is designed. The main idea of the NPDMD is to spread samples of two classes in a large interval in the respective positive or negative space along the projecting direction when the distance between the projection means for two classes is large enough. The salient features of the proposed NPDMD are: (1) The NPDMD operates well on HDLSS data sets; (2) The NPDMD solves the objective function in the entire feature space to avoid the data-piling phenomenon. (3) The NPDMD leverages on the low-rank property of the covariance matrix for HDLSS data sets to accelerate the computation speed. (4) The NPDMD is suitable for different real-word applications. (5) The NPDMD can be implemented readily using Quadratic Programming. Not only theoretical properties of the NPDMD have been derived, but also a series of evaluations have been conducted on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. Experimental results and comprehensive studies demonstrate the superiority of the NPDMD in terms of correct classification rate, mean within-group correct classification rate and the area under the ROC curve.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2022.108828