Learning the protein language: Evolution, structure, and function

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using langua...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Cell systems 2021-06, Vol.12 (6), p.654-669.e3
Hauptverfasser: Bepler, Tristan, Berger, Bonnie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community. •Deep protein language models can learn information from protein sequence•They capture the structure, function, and evolutionary fitness of sequence variants•They can be enriched with prior knowledge and inform function predictions•They can revolutionize protein biology by suggesting new ways to approach design In this synthesis, Bepler and Berger discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. They consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations.
ISSN:2405-4712
2405-4720
2405-4720
DOI:10.1016/j.cels.2021.05.017