Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nature communications 2022-10, Vol.13 (1), p.6298-11, Article 6298
Hauptverfasser:	Lupo, Umberto, Sgarbossa, Damiano, Bitbol, Anne-Florence
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114/1305 631/114/663/2009 631/114/739 Algorithms Contingency Correlation Humanities and Social Sciences Language multidisciplinary Phylogenetics Phylogeny Predictions Protein structure Proteins Proteins - chemistry Proteins - genetics Science Science (multidisciplinary) Sequence Alignment Structure-function relationships Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer’s column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models. Protein language models taking multiple sequence alignments as inputs capture protein structure and mutational effects. Here, the authors show that these models also encode phylogenetic relationships, and can disentangle correlations due to structural constraints from those due to phylogeny.
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-022-34032-y