COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space

Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to drama...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of chemical information and modeling 2024-02, Vol.64 (4), p.1145-1157
Hauptverfasser: Kaufman, Benjamin, Williams, Edward C., Underkoffler, Carl, Pederson, Ryan, Mardirossian, Narbe, Watson, Ian, Parkhill, John
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1157
container_issue 4
container_start_page 1145
container_title Journal of chemical information and modeling
container_volume 64
creator Kaufman, Benjamin
Williams, Edward C.
Underkoffler, Carl
Pederson, Ryan
Mardirossian, Narbe
Watson, Ian
Parkhill, John
description Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.
doi_str_mv 10.1021/acs.jcim.3c01753
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2922948614</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2922948614</sourcerecordid><originalsourceid>FETCH-LOGICAL-a364t-d4f9cabf291e36a1ce22d15fc0dbfa2737aec3cc1c753e9a196a77ca297b92473</originalsourceid><addsrcrecordid>eNp1kEtLAzEURoMotlb3rmTAjQtb85hJGndl8FGoVLRCd8Nt5o5OmUdNZgr-e1PbuhBc5Sac70tyCDlndMAoZzdg3GBp8nIgDGUqEgeky6JQ97Wk88P9HGnZISfOLSkVQkt-TDpiKJiUMuqSeTwdzca3wVNbNHlZp1AEcV01FlyTrzF4tujnvMqr9yCrbfCCK4sOq2ZzAFUazCys0brNNv7AMje-4HUFBk_JUQaFw7Pd2iNv93ez-LE_mT6M49GkD0KGTT8NM21gkXHNUEhgBjlPWZQZmi4y4EooQCOMYcZ_DzUwLUEpA1yrheahEj1yte1d2fqzRdckZe4MFgVUWLcu4ZpzHQ4lCz16-Qdd1q2t_Os8JXhEtVLcU3RLGVs7ZzFLVjYvwX4ljCYb64m3nmysJzvrPnKxK24XJaa_gb1mD1xvgZ_o_tJ_-74BYE2OWw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2932509772</pqid></control><display><type>article</type><title>COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space</title><source>American Chemical Society Journals</source><creator>Kaufman, Benjamin ; Williams, Edward C. ; Underkoffler, Carl ; Pederson, Ryan ; Mardirossian, Narbe ; Watson, Ian ; Parkhill, John</creator><creatorcontrib>Kaufman, Benjamin ; Williams, Edward C. ; Underkoffler, Carl ; Pederson, Ryan ; Mardirossian, Narbe ; Watson, Ian ; Parkhill, John</creatorcontrib><description>Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.</description><identifier>ISSN: 1549-9596</identifier><identifier>ISSN: 1549-960X</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.3c01753</identifier><identifier>PMID: 38316665</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Algorithms ; Carbonic anhydrase ; Datasets ; Design optimization ; Encoders-Decoders ; Machine Learning and Deep Learning ; Molecular structure ; Optimization ; Representations ; Structural models</subject><ispartof>Journal of chemical information and modeling, 2024-02, Vol.64 (4), p.1145-1157</ispartof><rights>2024 American Chemical Society</rights><rights>Copyright American Chemical Society Feb 26, 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a364t-d4f9cabf291e36a1ce22d15fc0dbfa2737aec3cc1c753e9a196a77ca297b92473</citedby><cites>FETCH-LOGICAL-a364t-d4f9cabf291e36a1ce22d15fc0dbfa2737aec3cc1c753e9a196a77ca297b92473</cites><orcidid>0000-0001-5739-9620</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c01753$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.3c01753$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,780,784,2765,27076,27924,27925,56738,56788</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38316665$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Kaufman, Benjamin</creatorcontrib><creatorcontrib>Williams, Edward C.</creatorcontrib><creatorcontrib>Underkoffler, Carl</creatorcontrib><creatorcontrib>Pederson, Ryan</creatorcontrib><creatorcontrib>Mardirossian, Narbe</creatorcontrib><creatorcontrib>Watson, Ian</creatorcontrib><creatorcontrib>Parkhill, John</creatorcontrib><title>COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.</description><subject>Algorithms</subject><subject>Carbonic anhydrase</subject><subject>Datasets</subject><subject>Design optimization</subject><subject>Encoders-Decoders</subject><subject>Machine Learning and Deep Learning</subject><subject>Molecular structure</subject><subject>Optimization</subject><subject>Representations</subject><subject>Structural models</subject><issn>1549-9596</issn><issn>1549-960X</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp1kEtLAzEURoMotlb3rmTAjQtb85hJGndl8FGoVLRCd8Nt5o5OmUdNZgr-e1PbuhBc5Sac70tyCDlndMAoZzdg3GBp8nIgDGUqEgeky6JQ97Wk88P9HGnZISfOLSkVQkt-TDpiKJiUMuqSeTwdzca3wVNbNHlZp1AEcV01FlyTrzF4tujnvMqr9yCrbfCCK4sOq2ZzAFUazCys0brNNv7AMje-4HUFBk_JUQaFw7Pd2iNv93ez-LE_mT6M49GkD0KGTT8NM21gkXHNUEhgBjlPWZQZmi4y4EooQCOMYcZ_DzUwLUEpA1yrheahEj1yte1d2fqzRdckZe4MFgVUWLcu4ZpzHQ4lCz16-Qdd1q2t_Os8JXhEtVLcU3RLGVs7ZzFLVjYvwX4ljCYb64m3nmysJzvrPnKxK24XJaa_gb1mD1xvgZ_o_tJ_-74BYE2OWw</recordid><startdate>20240226</startdate><enddate>20240226</enddate><creator>Kaufman, Benjamin</creator><creator>Williams, Edward C.</creator><creator>Underkoffler, Carl</creator><creator>Pederson, Ryan</creator><creator>Mardirossian, Narbe</creator><creator>Watson, Ian</creator><creator>Parkhill, John</creator><general>American Chemical Society</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5739-9620</orcidid></search><sort><creationdate>20240226</creationdate><title>COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space</title><author>Kaufman, Benjamin ; Williams, Edward C. ; Underkoffler, Carl ; Pederson, Ryan ; Mardirossian, Narbe ; Watson, Ian ; Parkhill, John</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a364t-d4f9cabf291e36a1ce22d15fc0dbfa2737aec3cc1c753e9a196a77ca297b92473</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Carbonic anhydrase</topic><topic>Datasets</topic><topic>Design optimization</topic><topic>Encoders-Decoders</topic><topic>Machine Learning and Deep Learning</topic><topic>Molecular structure</topic><topic>Optimization</topic><topic>Representations</topic><topic>Structural models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kaufman, Benjamin</creatorcontrib><creatorcontrib>Williams, Edward C.</creatorcontrib><creatorcontrib>Underkoffler, Carl</creatorcontrib><creatorcontrib>Pederson, Ryan</creatorcontrib><creatorcontrib>Mardirossian, Narbe</creatorcontrib><creatorcontrib>Watson, Ian</creatorcontrib><creatorcontrib>Parkhill, John</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kaufman, Benjamin</au><au>Williams, Edward C.</au><au>Underkoffler, Carl</au><au>Pederson, Ryan</au><au>Mardirossian, Narbe</au><au>Watson, Ian</au><au>Parkhill, John</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2024-02-26</date><risdate>2024</risdate><volume>64</volume><issue>4</issue><spage>1145</spage><epage>1157</epage><pages>1145-1157</pages><issn>1549-9596</issn><issn>1549-960X</issn><eissn>1549-960X</eissn><abstract>Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>38316665</pmid><doi>10.1021/acs.jcim.3c01753</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0001-5739-9620</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1549-9596
ispartof Journal of chemical information and modeling, 2024-02, Vol.64 (4), p.1145-1157
issn 1549-9596
1549-960X
1549-960X
language eng
recordid cdi_proquest_miscellaneous_2922948614
source American Chemical Society Journals
subjects Algorithms
Carbonic anhydrase
Datasets
Design optimization
Encoders-Decoders
Machine Learning and Deep Learning
Molecular structure
Optimization
Representations
Structural models
title COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T04%3A06%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=COATI:%20Multimodal%20Contrastive%20Pretraining%20for%20Representing%20and%20Traversing%20Chemical%20Space&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Kaufman,%20Benjamin&rft.date=2024-02-26&rft.volume=64&rft.issue=4&rft.spage=1145&rft.epage=1157&rft.pages=1145-1157&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.3c01753&rft_dat=%3Cproquest_cross%3E2922948614%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2932509772&rft_id=info:pmid/38316665&rfr_iscdi=true