Chemometrics for QSAR with low sequence homology: Mycobacterial promoter sequences recognition with 2D-RNA entropies

Predicting mycobacterial sequences promoter of protein synthesis is important in the study of protein metabolism regulation. This goal is however considered a challenging computational biology task due to low inter-sequences homology. Consequently, a previous work based only on DNA sequence had to u...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Chemometrics and intelligent laboratory systems 2007-01, Vol.85 (1), p.20-26
Hauptverfasser:	González-Díaz, Humberto, Pérez-Bello, Alcides, Cruz-Monteagudo, Maykel, González-Díaz, Yenny, Santana, Lourdes, Uriarte, Eugenio
Format:	Artikel
Sprache:	eng
Schlagworte:	Entropy Information theory Machine learning algorithms Markov models Mycobacterial promoter sequences Mycobacterium QSAR RNA secondary structure
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Predicting mycobacterial sequences promoter of protein synthesis is important in the study of protein metabolism regulation. This goal is however considered a challenging computational biology task due to low inter-sequences homology. Consequently, a previous work based only on DNA sequence had to use a large input parameter set and multilayered feed-forward ANN architecture trained using the error-back-propagation algorithm to raise an overall accuracy up to 97% [Kalate, et al. 2003. Comput. Biol. Chem. 27, 555–564]. Subsequently, one could expect that a notably simpler model may be derived using parameters based on non-linear structural information. In the present work, a method based on molecular folding negentropies ( Θ k ) is introduced to predict by the first time mycobacterial promoter sequences (mps) from the corresponding RNA secondary structure. The best QSAR equation found was the classification function mps = 4.921 × 0 Θ M − 1.205, which recognised 126/135 mps (93.3%) and 100% of 245 control sequences (cs). The model have shown a very high Mathew regression coefficient C = 0.949. Both average overall accuracy and predictability were 97.6%. Additionally, several machine learning algorithms were applied in order to reaffirm the validity of the LDA model from the chemometrics point of view. This linear model with only one parameter ( 0 Θ M) may be considered the simpler reported up-to-date by large, without lack of accuracy (97%) with respect to Kalate et al.'s model.
ISSN:	0169-7439 1873-3239
DOI:	10.1016/j.chemolab.2006.03.005