Leveraging genomic redundancy to improve inference and alignment of orthologous proteins

Abstract Identifying protein sequences with common ancestry is a core task in bioinformatics and evolutionary biology. However, methods for inferring and aligning such sequences in annotated genomes have not kept pace with the increasing scale and complexity of the available data. Thus, in this work...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:G3 : genes - genomes - genetics 2023-12, Vol.13 (12)
Hauptverfasser: Singleton, Marc, Eisen, Michael
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Abstract Identifying protein sequences with common ancestry is a core task in bioinformatics and evolutionary biology. However, methods for inferring and aligning such sequences in annotated genomes have not kept pace with the increasing scale and complexity of the available data. Thus, in this work, we implemented several improvements to the traditional methodology that more fully leverage the redundancy of closely related genomes and the organization of their annotations. Two highlights include the application of the more flexible k-clique percolation algorithm for identifying clusters of orthologous proteins and the development of a novel technique for removing poorly supported regions of alignments with a phylogenetic hidden Markov model (phylo-HMM). In making the latter, we wrote a fully documented Python package Homomorph that implements standard HMM algorithms and created a set of tutorials to promote its use by a wide audience. We applied the resulting pipeline to a set of 33 annotated Drosophila genomes, generating 22,813 orthologous groups and 8,566 high-quality alignments. Though the identification of orthologous proteins is a key step in many comparative genomics analyses, methods for inferring and aligning their sequences in annotated genomes have not kept pace with the increasing scale of the available data. Thus, Singleton and Eisen implemented several improvements to the traditional methodology that more fully leverage the redundancy of closely related genomes and the organization of their annotations. The authors applied the resulting pipeline to 33 annotated Drosophila genomes, generating 22,813 orthologous groups and 8,566 high-quality alignments.
ISSN:2160-1836
2160-1836
DOI:10.1093/g3journal/jkad222