Prediction of enzymatic function with high efficiency and a reduced number of features using genetic algorithm

The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers in biology and medicine 2023-05, Vol.158, p.106799-106799, Article 106799
Hauptverfasser: Reis, Diogo R., Santos, Bruno C., Bleicher, Lucas, Zárate, Luis E., Nobre, Cristiane N.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The post-genomic era has raised a growing demand for efficient procedures to identify protein functions, which can be accomplished by applying machine learning to the characteristics set extracted from the protein. This approach is feature-based and has been the focus of several works in bioinformatics. In this work, we investigated the characteristics of proteins, representing the primary, secondary, tertiary, and quaternary structures of the protein, that improve the model’s quality by applying dimensionality reduction techniques and using the Support Vector Machine classifier for predicting the enzymes’ classes. During the investigation, two approaches were evaluated: feature extraction/transformation, which was performed using the statistical technique Factor Analysis, and feature selection methods. For feature selection, we proposed an approach based on a genetic algorithm to face the optimization conflict between the simplicity and reliability of an ideal representation of the characteristics of the enzymes and also compared and employed other methods for this purpose. The best result was accomplished using a feature subset generated by our implementation of a multi-objective genetic algorithm enriched with features that this work identified as relevant to represent the enzymes. This subset representation reduced the dataset by about 87% and reached 85.78% of F-measure performance, improving the overall quality of the model classification. In addition, we verified in this work a subset addressed with only 28 features out of a total of 424 that reached a performance above 80% of F-measure for four of the six evaluated classes, showing that satisfactory classification performance can be achieved with a reduced number of enzymes’s characteristics. The datasets and implementations are openly available. •We identified 424 attributes of 17,275 unique sequences of the enzymes considered.•A multi-objective genetic algorithm was proposed to select the best attributes.•The method combines simplicity and reliability of an ideal representation of the enzymes.•Subset reduced the dataset by about 87% and reached 85.78% of the F-measure.
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2023.106799