Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Advances in both next-generation sequencing (NGS) technologies and mass spectrometry-based proteomics have allowed the continuous growth of available proteomes and metaproteomes in biological databases. However, the high protein structural variety in known proteomes makes the protein functional char...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:BMC bioinformatics 2017-07, Vol.18 (1), p.349-349, Article 349
Hauptverfasser: Ruiz-Blanco, Yasser B, Agüero-Chapin, Guillermin, García-Hernández, Enrique, Álvarez, Orlando, Antunes, Agostinho, Green, James
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Advances in both next-generation sequencing (NGS) technologies and mass spectrometry-based proteomics have allowed the continuous growth of available proteomes and metaproteomes in biological databases. However, the high protein structural variety in known proteomes makes the protein functional characterization a challenging task in modern Computational Biology and Bioinformatics [1]. As manually curated annotations are available only for a small portion of investigated systems; the wealth of genomic and transcriptomic information generated from NGS technologies [2] requires the use of accurate computational annotation tools [3]. The same is true for the functional annotation of 3D structures in databases such as the PDB [4], SCOP [5] and CATH [6], as biologically uncharacterized proteins are being incorporated continuously in these databases; currently about 3725 structures in the PDB have a classification of ‘unknown function’. The assignment of a functional class for a query protein is a complex problem, not just because of the structural complexity but, because a single protein can have multiple functions, either due to its multiple domains or its subcellular locations and substrate concentrations [7]. Nevertheless, protein functional inferences have traditionally relied on structural/sequence similarities provided by alignment-based algorithms. The most common alignment-based (AB) approaches used in genomic and amino acid sequence databases to identify protein functional signals include: the Smith Waterman algorithm [8], the Basic Local Alignment Search Tool (BLAST) suite of programs [9], and profile Hidden Markov Models (HMMs) [10]. Profile HMM are at the core of the popular Protein family (Pfam) database [11]. Particularly for an effective identification of enzymatic functions within proteomes, BLAST and HMMs have been implemented in the annotation pipeline of EnzymeDetector along with the integration of the main biological databases [12]. Despite the large success of these methods, sequence-similarity-based approaches often fail when attempting to align proteins that share less than 30-40% identity. Alignments within this so-called twilight zone are often unreliable, resulting in reduced prediction accuracy [13, 14]. This handicap has caused a sustained increase in the number of unannotated proteins during the examination of genomes and proteomes from a variety of organism and environmental samples. Consequently, alignment-free (AF) approaches are
ISSN:1471-2105
1471-2105
DOI:10.1186/s12859-017-1758-x