Large language models generate functional protein sequences across diverse families

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatical...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Nature biotechnology 2023-08, Vol.41 (8), p.1099-1106
Hauptverfasser: Madani, Ali, Krause, Ben, Greene, Eric R., Subramanian, Subu, Mohr, Benjamin P., Holton, James M., Olmos, Jose Luis, Xiong, Caiming, Sun, Zachary Z., Socher, Richard, Fraser, James S., Naik, Nikhil
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase. A generative deep-learning model designs artificial proteins with desired enzymatic activities.
ISSN:1087-0156
1546-1696
DOI:10.1038/s41587-022-01618-2