Gene association study with SVM, MLP and cross-validation for the diagnosis of diseases
Gene association study is one of the major challenges of biochip technology both for gene diagnosis where only a gene subset is responsible for some diseases, and for the treatment of the curse of dimensionality which occurs especially in DNA microarray datasets where there are more than thousands o...
Gespeichert in:
Veröffentlicht in: | Progress in natural science 2008, Vol.18 (6), p.741-750 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Gene association study is one of the major challenges of biochip technology both for gene diagnosis where only a gene subset is responsible for some diseases, and for the treatment of the curse of dimensionality which occurs especially in DNA microarray datasets where there are more than thousands of genes and only a few number of experiments (samples). This paper presents a gene selection method by training linear support vector machine (SVM)/nonlinear MLP (multilayer perceptron) classifiers and testing them with cross-validation for finding a gene subset which is optimal/suboptimal for the diagnosis of binary/multiple disease types. Genes are selected with linear SVM classifier for the diagnosis of each binary disease types pair and tested by leave-one-out cross-validation; then, genes in the gene subset initialized by the union of them are deleted one by one by removing the gene which brings the greatest decrease of the generalization power, for samples, on the gene subset after removal, where generalization is measured by training MLPs with leaveone-out and leave-four-out cross-validations. The proposed method was tested with experiments on real DNA microarray MIT data and NCI data. The result shows that it outperforms conventional SNR method in the separability of the data with expression levels on selected genes. For real DNA microarray MIT/NCI data, which is composed of 7129/2308 effective genes with only 72/64 labeled samples belonging to 2/4 disease classes, only 11/6 genes are selected to be diagnostic genes. The selected genes are tested by the classification of samples on these genes with SVM/MLP with leave-one-out/both leave-one-out and leave-four-out cross-validations. The result of no misclassification indicates that the selected genes can be really considered as diagnostic genes for the diagnosis of the corresponding diseases. |
---|---|
ISSN: | 1002-0071 |
DOI: | 10.1016/j.pnsc.2007.11.022 |