SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES...
Gespeichert in:
Veröffentlicht in: | Journal of chemical information and modeling 2021-04, Vol.61 (4), p.1560-1569 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1569 |
---|---|
container_issue | 4 |
container_start_page | 1560 |
container_title | Journal of chemical information and modeling |
container_volume | 61 |
creator | Li, Xinhao Fourches, Denis |
description | Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE. |
doi_str_mv | 10.1021/acs.jcim.0c01127 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2501474668</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501474668</sourcerecordid><originalsourceid>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</originalsourceid><addsrcrecordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2524950058</pqid></control><display><type>article</type><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><source>American Chemical Society Journals</source><creator>Li, Xinhao ; Fourches, Denis</creator><creatorcontrib>Li, Xinhao ; Fourches, Denis</creatorcontrib><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.0c01127</identifier><identifier>PMID: 33715361</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Algorithms ; Datasets ; Deep learning ; Machine learning ; Machine Learning and Deep Learning ; Molecular structure ; Prediction models ; Training</subject><ispartof>Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569</ispartof><rights>2021 American Chemical Society</rights><rights>Copyright American Chemical Society Apr 26, 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</citedby><cites>FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</cites><orcidid>0000-0001-5642-8303 ; 0000-0002-1821-2680</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c01127$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.0c01127$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,780,784,2765,27076,27924,27925,56738,56788</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33715361$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</description><subject>Algorithms</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Machine learning</subject><subject>Machine Learning and Deep Learning</subject><subject>Molecular structure</subject><subject>Prediction models</subject><subject>Training</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp1kMFLwzAUh4Mobk7vniTgxYOdL02aNd7GNnUwUdgEPZU0TWfm2sykFfSvt3ObB8FTHuT7_d7jQ-iUQJdASK6k8t2FMkUXFBAS9vZQm0RMBILD8_5ujgRvoSPvFwCUCh4eohalPRJRTtroZXo_noym-FEah0elspkp59e4j4eyksHQmQ9d4mmd-srVqqqdxjP7pkvzJStjS9xfzq0z1WuBc-vwUOsVnmjpyqbkGB3kcun1yfbtoKeb0WxwF0webseD_iSQDHgVCBryVHGVMspTnmmeS5ZCrFSsBBGRkkKGEgjnDHQehYTGoFLCYsGb71hktIMuNr0rZ99r7aukMF7p5VKW2tY-CSMgrMc4jxv0_A-6sLUrm-saKmQiAojWFGwo5az3TufJyplCus-EQLLWnjTak7X2ZKu9iZxti-u00NlvYOe5AS43wE90t_Tfvm86zoxX</recordid><startdate>20210426</startdate><enddate>20210426</enddate><creator>Li, Xinhao</creator><creator>Fourches, Denis</creator><general>American Chemical Society</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid></search><sort><creationdate>20210426</creationdate><title>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</title><author>Li, Xinhao ; Fourches, Denis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a406t-9326bc6cb436b6de6fa4b08cc8c9195ca9a2a016640ef521380cb1489691989d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Machine learning</topic><topic>Machine Learning and Deep Learning</topic><topic>Molecular structure</topic><topic>Prediction models</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Xinhao</creatorcontrib><creatorcontrib>Fourches, Denis</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Xinhao</au><au>Fourches, Denis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2021-04-26</date><risdate>2021</risdate><volume>61</volume><issue>4</issue><spage>1560</spage><epage>1569</epage><pages>1560-1569</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>33715361</pmid><doi>10.1021/acs.jcim.0c01127</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0001-5642-8303</orcidid><orcidid>https://orcid.org/0000-0002-1821-2680</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1549-9596 |
ispartof | Journal of chemical information and modeling, 2021-04, Vol.61 (4), p.1560-1569 |
issn | 1549-9596 1549-960X |
language | eng |
recordid | cdi_proquest_miscellaneous_2501474668 |
source | American Chemical Society Journals |
subjects | Algorithms Datasets Deep learning Machine learning Machine Learning and Deep Learning Molecular structure Prediction models Training |
title | SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T15%3A10%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SMILES%20Pair%20Encoding:%20A%20Data-Driven%20Substructure%20Tokenization%20Algorithm%20for%20Deep%20Learning&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Li,%20Xinhao&rft.date=2021-04-26&rft.volume=61&rft.issue=4&rft.spage=1560&rft.epage=1569&rft.pages=1560-1569&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.0c01127&rft_dat=%3Cproquest_cross%3E2501474668%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2524950058&rft_id=info:pmid/33715361&rfr_iscdi=true |