SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of chemical information and modeling 2021-04, Vol.61 (4), p.1560-1569
Hauptverfasser:	Li, Xinhao, Fourches, Denis
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Datasets Deep learning Machine learning Machine Learning and Deep Learning Molecular structure Prediction models Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1569
container_issue	4
container_start_page	1560
container_title	Journal of chemical information and modeling
container_volume	61
creator	Li, Xinhao Fourches, Denis
description	Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
doi_str_mv	10.1021/acs.jcim.0c01127
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2501474668</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501474668</sourcerecordid><originalsourceid>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</originalsourceid><addsrcrecordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2524950058</pqid></control><display><type>article</type><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><source>American Chemical Society Journals</source><creator>Li, Xinhao ; Fourches, Denis</creator><creatorcontrib>Li, Xinhao ; Fourches, Denis</creatorcontrib><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.0c01127</identifier><identifier>PMID: 33715361</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Algorithms ; Datasets ; Deep learning ; Machine learning ; Machine Learning and Deep Learning ; Molecular structure ; Prediction models ; Training</subject><ispartof>Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569</ispartof><rights>2021 American Chemical Society</rights><rights>Copyright American Chemical Society Apr 26, 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</citedby><cites>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</cites><orcidid>0000-0001-5642-8303 ; 0000-0002-1821-2680</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c01127$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.0c01127$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,780,784,2765,27076,27924,27925,56738,56788</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33715361$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><subject>Algorithms</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Machine learning</subject><subject>Machine Learning and Deep Learning</subject><subject>Molecular structure</subject><subject>Prediction models</subject><subject>Training</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</recordid><startdate>20210426</startdate><enddate>20210426</enddate><creator>Li, Xinhao</creator><creator>Fourches, Denis</creator><general>American Chemical Society</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid></search><sort><creationdate>20210426</creationdate><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><author>Li, Xinhao ; Fourches, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Machine learning</topic><topic>Machine Learning and Deep Learning</topic><topic>Molecular structure</topic><topic>Prediction models</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Xinhao</au><au>Fourches, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2021-04-26</date><risdate>2021</risdate><volume>61</volume><issue>4</issue><spage>1560</spage><epage>1569</epage><pages>1560-1569</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>33715361</pmid><doi>10.1021/acs.jcim.0c01127</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1549-9596
ispartof	Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569
issn	1549-9596 1549-960X
language	eng
recordid	cdi_proquest_miscellaneous_2501474668
source	American Chemical Society Journals
subjects	Algorithms Datasets Deep learning Machine learning Machine Learning and Deep Learning Molecular structure Prediction models Training
title	SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T15%3A10%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SMILES%20Pair%20Encoding:%20A%20Data-Driven%20Substructure%20Tokenization%20Algorithm%20for%20Deep%20Learning&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Li,%20Xinhao&rft.date=2021-04-26&rft.volume=61&rft.issue=4&rft.spage=1560&rft.epage=1569&rft.pages=1560-1569&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.0c01127&rft_dat=%3Cproquest_cross%3E2501474668%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2524950058&rft_id=info:pmid/33715361&rfr_iscdi=true