A primer on model-guided exploration of fitness landscapes for biological sequence design

Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies rem...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2020-10
Hauptverfasser: Sinai, Sam, Kelsic, Eric D
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Sinai, Sam
Kelsic, Eric D
description Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2453523423</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2453523423</sourcerecordid><originalsourceid>FETCH-proquest_journals_24535234233</originalsourceid><addsrcrecordid>eNqNjkEKwjAQAIMgWLR_WPBcqEmjXkUUH-DFk8RmU1LSbMy24PPtwQd4msPMYRaikErtqmMj5UqUzH1d13J_kFqrQjxOkLIfMANFGMhiqLrJW7SAnxQom9HPghw4P0ZkhmCi5dYkZHCU4eUpUOdbE4DxPWFsESyy7-JGLJ0JjOWPa7G9Xu7nW5UyzSGPz56mHGf1lI1WWqpm_vyv-gIPqEKw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2453523423</pqid></control><display><type>article</type><title>A primer on model-guided exploration of fitness landscapes for biological sequence design</title><source>Freely Accessible Journals</source><creator>Sinai, Sam ; Kelsic, Eric D</creator><creatorcontrib>Sinai, Sam ; Kelsic, Eric D</creatorcontrib><description>Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Bioengineering ; Biological effects ; Biologists ; Design of experiments ; Exploration ; Fitness ; Interpolation ; Landscape design ; Machine learning ; Prediction models ; Search process</subject><ispartof>arXiv.org, 2020-10</ispartof><rights>2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Sinai, Sam</creatorcontrib><creatorcontrib>Kelsic, Eric D</creatorcontrib><title>A primer on model-guided exploration of fitness landscapes for biological sequence design</title><title>arXiv.org</title><description>Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.</description><subject>Algorithms</subject><subject>Bioengineering</subject><subject>Biological effects</subject><subject>Biologists</subject><subject>Design of experiments</subject><subject>Exploration</subject><subject>Fitness</subject><subject>Interpolation</subject><subject>Landscape design</subject><subject>Machine learning</subject><subject>Prediction models</subject><subject>Search process</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjkEKwjAQAIMgWLR_WPBcqEmjXkUUH-DFk8RmU1LSbMy24PPtwQd4msPMYRaikErtqmMj5UqUzH1d13J_kFqrQjxOkLIfMANFGMhiqLrJW7SAnxQom9HPghw4P0ZkhmCi5dYkZHCU4eUpUOdbE4DxPWFsESyy7-JGLJ0JjOWPa7G9Xu7nW5UyzSGPz56mHGf1lI1WWqpm_vyv-gIPqEKw</recordid><startdate>20201023</startdate><enddate>20201023</enddate><creator>Sinai, Sam</creator><creator>Kelsic, Eric D</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20201023</creationdate><title>A primer on model-guided exploration of fitness landscapes for biological sequence design</title><author>Sinai, Sam ; Kelsic, Eric D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24535234233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Bioengineering</topic><topic>Biological effects</topic><topic>Biologists</topic><topic>Design of experiments</topic><topic>Exploration</topic><topic>Fitness</topic><topic>Interpolation</topic><topic>Landscape design</topic><topic>Machine learning</topic><topic>Prediction models</topic><topic>Search process</topic><toplevel>online_resources</toplevel><creatorcontrib>Sinai, Sam</creatorcontrib><creatorcontrib>Kelsic, Eric D</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sinai, Sam</au><au>Kelsic, Eric D</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A primer on model-guided exploration of fitness landscapes for biological sequence design</atitle><jtitle>arXiv.org</jtitle><date>2020-10-23</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2453523423
source Freely Accessible Journals
subjects Algorithms
Bioengineering
Biological effects
Biologists
Design of experiments
Exploration
Fitness
Interpolation
Landscape design
Machine learning
Prediction models
Search process
title A primer on model-guided exploration of fitness landscapes for biological sequence design
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T19%3A23%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20primer%20on%20model-guided%20exploration%20of%20fitness%20landscapes%20for%20biological%20sequence%20design&rft.jtitle=arXiv.org&rft.au=Sinai,%20Sam&rft.date=2020-10-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2453523423%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2453523423&rft_id=info:pmid/&rfr_iscdi=true