Gene Classification Using Codon Usage and Support Vector Machines

A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage freq...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on computational biology and bioinformatics 2009-01, Vol.6 (1), p.134-143
Hauptverfasser:	Jianmin Ma, Nguyen, M.N., Rajapakse, J.C.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Cluster analysis Codon - classification Codon - genetics codon usage bias Databases, Genetic Discriminant Analysis DNA Frequency gene classification Gene expression Genes Genes, MHC Class I Genes, MHC Class II Genetic Code Genetic mutations HLA Antigens - classification HLA Antigens - genetics Human Leukocyte Antigen (HLA) Humans Major Histocompatibility Complex (MHC) Major Histocompatibility Complex - genetics Normal Distribution Pattern Recognition, Automated - methods Proteins Relative Synonymous Codon Use (RSCU) frequency Reproducibility of Results Sequence Analysis, DNA - methods Sequences Studies Support vector machine classification Support vector machines White blood cells
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.
ISSN:	1545-5963 1557-9964
DOI:	10.1109/TCBB.2007.70240