Identifying promising sequences for protein engineering using a deep transformer protein language model

Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-con...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proteins, structure, function, and bioinformatics structure, function, and bioinformatics, 2023-11, Vol.91 (11), p.1471-1486
Hauptverfasser:	Frisby, Trevor S, Langmead, Christopher James
Format:	Artikel
Sprache:	eng
Schlagworte:	Amino acid sequence Binders Knowledge management Language Mutagenesis Nanobodies Optimization Prediction models Protein engineering Proteins Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1486
container_issue	11
container_start_page	1471
container_title	Proteins, structure, function, and bioinformatics
container_volume	91
creator	Frisby, Trevor S Langmead, Christopher James
description	Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.
doi_str_mv	10.1002/prot.26536
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2827921197</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2827921197</sourcerecordid><originalsourceid>FETCH-LOGICAL-c310t-a9bd5d1840e15f094259946acf700997ee8ac904566e02ee4fe0abcef18fa3853</originalsourceid><addsrcrecordid>eNpd0F1LwzAUBuAgipvTG3-AFLwRofMkaZrkUoYfg4E3el2y9qR0rOlM2ov9e9PND_Aqh_BweN9DyDWFOQVgDzvf9XOWC56fkCkFLVOgPDslU1BKplwoMSEXIWwAINc8PycTLjmXGtiU1MsKXd_YfePqJC5qmzBOAT8HdCWGxHZ-_O-xcQm6unGIfhTDwZmkQtwlvTcuRNniH94aVw-mxqTtKtxekjNrtgGvvt8Z-Xh-el-8pqu3l-XicZWWnEKfGr2uREVVBkiFBZ0xoXWWm9JKAK0lojKlhkzkOQJDzCyCWZdoqbKGK8Fn5O64N8aIFUJfxEYlbmMa7IZQMMWkZpRqGentP7rpBu9iuqgkZ4ozgKjuj6r0XQgebbHzTWv8vqBQjOcvxsLF4fwR33yvHNYtVr_05978CxPPgis</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2873283200</pqid></control><display><type>article</type><title>Identifying promising sequences for protein engineering using a deep transformer protein language model</title><source>Wiley-Blackwell Journals</source><creator>Frisby, Trevor S ; Langmead, Christopher James</creator><creatorcontrib>Frisby, Trevor S ; Langmead, Christopher James</creatorcontrib><description>Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.</description><identifier>ISSN: 0887-3585</identifier><identifier>EISSN: 1097-0134</identifier><identifier>DOI: 10.1002/prot.26536</identifier><identifier>PMID: 37337902</identifier><language>eng</language><publisher>United States: Wiley Subscription Services, Inc</publisher><subject>Amino acid sequence ; Binders ; Knowledge management ; Language ; Mutagenesis ; Nanobodies ; Optimization ; Prediction models ; Protein engineering ; Proteins ; Transformers</subject><ispartof>Proteins, structure, function, and bioinformatics, 2023-11, Vol.91 (11), p.1471-1486</ispartof><rights>2023 The Authors. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.</rights><rights>2023. This article is published under http://creativecommons.org/licenses/by-nc/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c310t-a9bd5d1840e15f094259946acf700997ee8ac904566e02ee4fe0abcef18fa3853</cites><orcidid>0000-0002-2865-6955</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37337902$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Frisby, Trevor S</creatorcontrib><creatorcontrib>Langmead, Christopher James</creatorcontrib><title>Identifying promising sequences for protein engineering using a deep transformer protein language model</title><title>Proteins, structure, function, and bioinformatics</title><addtitle>Proteins</addtitle><description>Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.</description><subject>Amino acid sequence</subject><subject>Binders</subject><subject>Knowledge management</subject><subject>Language</subject><subject>Mutagenesis</subject><subject>Nanobodies</subject><subject>Optimization</subject><subject>Prediction models</subject><subject>Protein engineering</subject><subject>Proteins</subject><subject>Transformers</subject><issn>0887-3585</issn><issn>1097-0134</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpd0F1LwzAUBuAgipvTG3-AFLwRofMkaZrkUoYfg4E3el2y9qR0rOlM2ov9e9PND_Aqh_BweN9DyDWFOQVgDzvf9XOWC56fkCkFLVOgPDslU1BKplwoMSEXIWwAINc8PycTLjmXGtiU1MsKXd_YfePqJC5qmzBOAT8HdCWGxHZ-_O-xcQm6unGIfhTDwZmkQtwlvTcuRNniH94aVw-mxqTtKtxekjNrtgGvvt8Z-Xh-el-8pqu3l-XicZWWnEKfGr2uREVVBkiFBZ0xoXWWm9JKAK0lojKlhkzkOQJDzCyCWZdoqbKGK8Fn5O64N8aIFUJfxEYlbmMa7IZQMMWkZpRqGentP7rpBu9iuqgkZ4ozgKjuj6r0XQgebbHzTWv8vqBQjOcvxsLF4fwR33yvHNYtVr_05978CxPPgis</recordid><startdate>20231101</startdate><enddate>20231101</enddate><creator>Frisby, Trevor S</creator><creator>Langmead, Christopher James</creator><general>Wiley Subscription Services, Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QL</scope><scope>7QO</scope><scope>7QP</scope><scope>7QR</scope><scope>7TK</scope><scope>7TM</scope><scope>7U9</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>H94</scope><scope>K9.</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-2865-6955</orcidid></search><sort><creationdate>20231101</creationdate><title>Identifying promising sequences for protein engineering using a deep transformer protein language model</title><author>Frisby, Trevor S ; Langmead, Christopher James</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c310t-a9bd5d1840e15f094259946acf700997ee8ac904566e02ee4fe0abcef18fa3853</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Amino acid sequence</topic><topic>Binders</topic><topic>Knowledge management</topic><topic>Language</topic><topic>Mutagenesis</topic><topic>Nanobodies</topic><topic>Optimization</topic><topic>Prediction models</topic><topic>Protein engineering</topic><topic>Proteins</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Frisby, Trevor S</creatorcontrib><creatorcontrib>Langmead, Christopher James</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Proteins, structure, function, and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Frisby, Trevor S</au><au>Langmead, Christopher James</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Identifying promising sequences for protein engineering using a deep transformer protein language model</atitle><jtitle>Proteins, structure, function, and bioinformatics</jtitle><addtitle>Proteins</addtitle><date>2023-11-01</date><risdate>2023</risdate><volume>91</volume><issue>11</issue><spage>1471</spage><epage>1486</epage><pages>1471-1486</pages><issn>0887-3585</issn><eissn>1097-0134</eissn><abstract>Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.</abstract><cop>United States</cop><pub>Wiley Subscription Services, Inc</pub><pmid>37337902</pmid><doi>10.1002/prot.26536</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0002-2865-6955</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0887-3585
ispartof	Proteins, structure, function, and bioinformatics, 2023-11, Vol.91 (11), p.1471-1486
issn	0887-3585 1097-0134
language	eng
recordid	cdi_proquest_miscellaneous_2827921197
source	Wiley-Blackwell Journals
subjects	Amino acid sequence Binders Knowledge management Language Mutagenesis Nanobodies Optimization Prediction models Protein engineering Proteins Transformers
title	Identifying promising sequences for protein engineering using a deep transformer protein language model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T06%3A46%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Identifying%20promising%20sequences%20for%20protein%20engineering%20using%20a%20deep%20transformer%20protein%20language%20model&rft.jtitle=Proteins,%20structure,%20function,%20and%20bioinformatics&rft.au=Frisby,%20Trevor%20S&rft.date=2023-11-01&rft.volume=91&rft.issue=11&rft.spage=1471&rft.epage=1486&rft.pages=1471-1486&rft.issn=0887-3585&rft.eissn=1097-0134&rft_id=info:doi/10.1002/prot.26536&rft_dat=%3Cproquest_cross%3E2827921197%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2873283200&rft_id=info:pmid/37337902&rfr_iscdi=true