SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of chemical information and modeling 2021-04, Vol.61 (4), p.1560-1569
Hauptverfasser: Li, Xinhao, Fourches, Denis
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1569
container_issue 4
container_start_page 1560
container_title Journal of chemical information and modeling
container_volume 61
creator Li, Xinhao
Fourches, Denis
description Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
doi_str_mv 10.1021/acs.jcim.0c01127
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2501474668</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501474668</sourcerecordid><originalsourceid>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</originalsourceid><addsrcrecordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2524950058</pqid></control><display><type>article</type><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><source>American Chemical Society Journals</source><creator>Li, Xinhao ; Fourches, Denis</creator><creatorcontrib>Li, Xinhao ; Fourches, Denis</creatorcontrib><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.0c01127</identifier><identifier>PMID: 33715361</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Algorithms ; Datasets ; Deep learning ; Machine learning ; Machine Learning and Deep Learning ; Molecular structure ; Prediction models ; Training</subject><ispartof>Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569</ispartof><rights>2021 American Chemical Society</rights><rights>Copyright American Chemical Society Apr 26, 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</citedby><cites>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</cites><orcidid>0000-0001-5642-8303 ; 0000-0002-1821-2680</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c01127$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.0c01127$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,780,784,2765,27076,27924,27925,56738,56788</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33715361$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><subject>Algorithms</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Machine learning</subject><subject>Machine Learning and Deep Learning</subject><subject>Molecular structure</subject><subject>Prediction models</subject><subject>Training</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</recordid><startdate>20210426</startdate><enddate>20210426</enddate><creator>Li, Xinhao</creator><creator>Fourches, Denis</creator><general>American Chemical Society</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid></search><sort><creationdate>20210426</creationdate><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><author>Li, Xinhao ; Fourches, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Machine learning</topic><topic>Machine Learning and Deep Learning</topic><topic>Molecular structure</topic><topic>Prediction models</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Xinhao</au><au>Fourches, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2021-04-26</date><risdate>2021</risdate><volume>61</volume><issue>4</issue><spage>1560</spage><epage>1569</epage><pages>1560-1569</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>33715361</pmid><doi>10.1021/acs.jcim.0c01127</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1549-9596
ispartof Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569
issn 1549-9596
1549-960X
language eng
recordid cdi_proquest_miscellaneous_2501474668
source American Chemical Society Journals
subjects Algorithms
Datasets
Deep learning
Machine learning
Machine Learning and Deep Learning
Molecular structure
Prediction models
Training
title SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T15%3A10%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SMILES%20Pair%20Encoding:%20A%20Data-Driven%20Substructure%20Tokenization%20Algorithm%20for%20Deep%20Learning&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Li,%20Xinhao&rft.date=2021-04-26&rft.volume=61&rft.issue=4&rft.spage=1560&rft.epage=1569&rft.pages=1560-1569&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.0c01127&rft_dat=%3Cproquest_cross%3E2501474668%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2524950058&rft_id=info:pmid/33715361&rfr_iscdi=true