Protein embeddings improve phage-host interaction prediction

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2023-07, Vol.18 (7), p.e0289030
Hauptverfasser:	Gonzales, Mark Edward M, Ureta, Jennifer C, Shrestha, Anish M S
Format:	Artikel
Sprache:	eng
Schlagworte:	Amino Acid Sequence Amino acids Analysis Annotations Antimicrobial resistance Bacteriophages Binding Binding proteins Bioinformatics Biology and Life Sciences Causes of Computer and Information Sciences Data collection Differential Threshold Drug resistance in microorganisms Engineering and Technology Genomes Language Medicine and Health Sciences Mental Recall Modelling Phages Predictions Protein binding Proteins Proteome Proteomes Receptors Social Sciences
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0289030