Large language models generate functional protein sequences across diverse families

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatical...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nature biotechnology 2023-08, Vol.41 (8), p.1099-1106
Hauptverfasser:	Madani, Ali, Krause, Ben, Greene, Eric R., Subramanian, Subu, Mohr, Benjamin P., Holton, James M., Olmos, Jose Luis, Xiong, Caiming, Sun, Zachary Z., Socher, Richard, Fraser, James S., Naik, Nikhil
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114/1305 631/45/607 631/61/475 Agriculture Amino acids Artificial intelligence BASIC BIOLOGICAL SCIENCES Bioinformatics Biomedical and Life Sciences Biomedical Engineering/Biotechnology Biomedicine Biophysics Biotechnology Chorismate mutase Controllability Datasets Deep learning Enzymatic activity enzymes Language Large language models Life Sciences Lysozyme machine learning Malate dehydrogenase Natural language Neural networks Protein engineering Protein families Proteins proteomics Sentences Sequences Tags
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase. A generative deep-learning model designs artificial proteins with desired enzymatic activities.
ISSN:	1087-0156 1546-1696
DOI:	10.1038/s41587-022-01618-2