SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications

We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and value...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Physical chemistry chemical physics : PCCP 2023-07, Vol.25 (27), p.1838-1847
Hauptverfasser: Dias-Silva, Jefferson R, Oliveira, Vitor M, Sanches-Neto, Flávio O, Wilhelms, Renan Z, Queiroz Júnior, Luiz H. K
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1847
container_issue 27
container_start_page 1838
container_title Physical chemistry chemical physics : PCCP
container_volume 25
creator Dias-Silva, Jefferson R
Oliveira, Vitor M
Sanches-Neto, Flávio O
Wilhelms, Renan Z
Queiroz Júnior, Luiz H. K
description We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format. For each functional group, five ML models were built and validated following the OECD principles: internal and external validations, applicability domains, and mechanistic interpretations. All the models resulted in high goodness-of-fit for the training and test sets with MCC respectively between 0.626 and 0.909 and 0.653 and 0.917, and J ranging from 0.812 to 0.957 and 0.825 to 0.961. Using the SHAP (SHapley Additive exPlanations) approach, the mechanistic interpretations of the models were explored; the results indicated that the most important variables for model decision making were coherent with the expected chemical shifts for each functional group. Several metrics, including Tanimoto, geometric, arithmetic, and Tversky, can be used to perform the similarity calculation for the search algorithm. This algorithm can also incorporate additional variables, such as the correction parameter and the difference between the amount of signals in the query spectrum and the database spectra, while preserving its high performance speed. We hope that our descriptor can link information from spectroscopic/spectrometric techniques with ML models to expand the possibilities in understanding the field of cheminformatics. All databases and algorithms developed for this work are open sources and freely accessible. We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques.
doi_str_mv 10.1039/d3cp00734k
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1039_D3CP00734K</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2830667662</sourcerecordid><originalsourceid>FETCH-LOGICAL-c337t-b773315f4928a2c2da8184d6c01fde9af9589ee4ebc559ee59e4963e600a3743</originalsourceid><addsrcrecordid>eNpd0U1r3DAQBmBRGpqP5tJ7i6CXEOJW8tiS3VvZNh8k0EByN7PSuKvUthzJJjS3_PMo2XQLOQgNeh8GwcvYBym-SAH1VwtmFEJD8ecN25GFgqwWVfF2M2u1zXZjvBFCyFLCO7YNGnSllNxhD1cjmSng8eU3jnygOx7XD9kSI1luKZrgxskHPnmOznI3cLOi3g2tDz1OzsQj3vuOzNxhSBEGNBMFd58yP3AcLI-Ewaw4dr99cNOq5ziOnTPPIL5nWy12kfZf7j12ffzzenGaXfw6OVt8v8gMgJ6ypdYAsmyLOq8wN7nFSlaFVUbI1lKNbV1WNVFBS1OWaUinqBWQEgJBF7DHDtZrx-BvZ4pT07toqOtwID_HJq9AKKWVyhP9_Ire-DkM6XNPqqxyIaFM6nCtTPAxBmqbMbgew99Giuapl-YHLC6fezlP-NPLynnZk93Qf0Uk8HENQjSb9H-x8AirzZOv</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2835820135</pqid></control><display><type>article</type><title>SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications</title><source>Royal Society Of Chemistry Journals</source><source>Alma/SFX Local Collection</source><creator>Dias-Silva, Jefferson R ; Oliveira, Vitor M ; Sanches-Neto, Flávio O ; Wilhelms, Renan Z ; Queiroz Júnior, Luiz H. K</creator><creatorcontrib>Dias-Silva, Jefferson R ; Oliveira, Vitor M ; Sanches-Neto, Flávio O ; Wilhelms, Renan Z ; Queiroz Júnior, Luiz H. K</creatorcontrib><description>We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format. For each functional group, five ML models were built and validated following the OECD principles: internal and external validations, applicability domains, and mechanistic interpretations. All the models resulted in high goodness-of-fit for the training and test sets with MCC respectively between 0.626 and 0.909 and 0.653 and 0.917, and J ranging from 0.812 to 0.957 and 0.825 to 0.961. Using the SHAP (SHapley Additive exPlanations) approach, the mechanistic interpretations of the models were explored; the results indicated that the most important variables for model decision making were coherent with the expected chemical shifts for each functional group. Several metrics, including Tanimoto, geometric, arithmetic, and Tversky, can be used to perform the similarity calculation for the search algorithm. This algorithm can also incorporate additional variables, such as the correction parameter and the difference between the amount of signals in the query spectrum and the database spectra, while preserving its high performance speed. We hope that our descriptor can link information from spectroscopic/spectrometric techniques with ML models to expand the possibilities in understanding the field of cheminformatics. All databases and algorithms developed for this work are open sources and freely accessible. We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques.</description><identifier>ISSN: 1463-9076</identifier><identifier>EISSN: 1463-9084</identifier><identifier>DOI: 10.1039/d3cp00734k</identifier><identifier>PMID: 37378661</identifier><language>eng</language><publisher>England: Royal Society of Chemistry</publisher><subject>Algorithms ; Chemical equilibrium ; Decision making ; Functional groups ; Goodness of fit ; Machine learning ; NMR ; Nuclear magnetic resonance ; Search algorithms ; Similarity ; Spectra ; Spectrometry ; Spectroscopy</subject><ispartof>Physical chemistry chemical physics : PCCP, 2023-07, Vol.25 (27), p.1838-1847</ispartof><rights>Copyright Royal Society of Chemistry 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c337t-b773315f4928a2c2da8184d6c01fde9af9589ee4ebc559ee59e4963e600a3743</citedby><cites>FETCH-LOGICAL-c337t-b773315f4928a2c2da8184d6c01fde9af9589ee4ebc559ee59e4963e600a3743</cites><orcidid>0000-0002-0399-4915 ; 0000-0003-1706-5751 ; 0000-0002-0664-171X ; 0000-0001-9049-1533 ; 0000-0003-2627-6067</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37378661$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dias-Silva, Jefferson R</creatorcontrib><creatorcontrib>Oliveira, Vitor M</creatorcontrib><creatorcontrib>Sanches-Neto, Flávio O</creatorcontrib><creatorcontrib>Wilhelms, Renan Z</creatorcontrib><creatorcontrib>Queiroz Júnior, Luiz H. K</creatorcontrib><title>SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications</title><title>Physical chemistry chemical physics : PCCP</title><addtitle>Phys Chem Chem Phys</addtitle><description>We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format. For each functional group, five ML models were built and validated following the OECD principles: internal and external validations, applicability domains, and mechanistic interpretations. All the models resulted in high goodness-of-fit for the training and test sets with MCC respectively between 0.626 and 0.909 and 0.653 and 0.917, and J ranging from 0.812 to 0.957 and 0.825 to 0.961. Using the SHAP (SHapley Additive exPlanations) approach, the mechanistic interpretations of the models were explored; the results indicated that the most important variables for model decision making were coherent with the expected chemical shifts for each functional group. Several metrics, including Tanimoto, geometric, arithmetic, and Tversky, can be used to perform the similarity calculation for the search algorithm. This algorithm can also incorporate additional variables, such as the correction parameter and the difference between the amount of signals in the query spectrum and the database spectra, while preserving its high performance speed. We hope that our descriptor can link information from spectroscopic/spectrometric techniques with ML models to expand the possibilities in understanding the field of cheminformatics. All databases and algorithms developed for this work are open sources and freely accessible. We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques.</description><subject>Algorithms</subject><subject>Chemical equilibrium</subject><subject>Decision making</subject><subject>Functional groups</subject><subject>Goodness of fit</subject><subject>Machine learning</subject><subject>NMR</subject><subject>Nuclear magnetic resonance</subject><subject>Search algorithms</subject><subject>Similarity</subject><subject>Spectra</subject><subject>Spectrometry</subject><subject>Spectroscopy</subject><issn>1463-9076</issn><issn>1463-9084</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpd0U1r3DAQBmBRGpqP5tJ7i6CXEOJW8tiS3VvZNh8k0EByN7PSuKvUthzJJjS3_PMo2XQLOQgNeh8GwcvYBym-SAH1VwtmFEJD8ecN25GFgqwWVfF2M2u1zXZjvBFCyFLCO7YNGnSllNxhD1cjmSng8eU3jnygOx7XD9kSI1luKZrgxskHPnmOznI3cLOi3g2tDz1OzsQj3vuOzNxhSBEGNBMFd58yP3AcLI-Ewaw4dr99cNOq5ziOnTPPIL5nWy12kfZf7j12ffzzenGaXfw6OVt8v8gMgJ6ypdYAsmyLOq8wN7nFSlaFVUbI1lKNbV1WNVFBS1OWaUinqBWQEgJBF7DHDtZrx-BvZ4pT07toqOtwID_HJq9AKKWVyhP9_Ire-DkM6XNPqqxyIaFM6nCtTPAxBmqbMbgew99Giuapl-YHLC6fezlP-NPLynnZk93Qf0Uk8HENQjSb9H-x8AirzZOv</recordid><startdate>20230712</startdate><enddate>20230712</enddate><creator>Dias-Silva, Jefferson R</creator><creator>Oliveira, Vitor M</creator><creator>Sanches-Neto, Flávio O</creator><creator>Wilhelms, Renan Z</creator><creator>Queiroz Júnior, Luiz H. K</creator><general>Royal Society of Chemistry</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>L7M</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-0399-4915</orcidid><orcidid>https://orcid.org/0000-0003-1706-5751</orcidid><orcidid>https://orcid.org/0000-0002-0664-171X</orcidid><orcidid>https://orcid.org/0000-0001-9049-1533</orcidid><orcidid>https://orcid.org/0000-0003-2627-6067</orcidid></search><sort><creationdate>20230712</creationdate><title>SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications</title><author>Dias-Silva, Jefferson R ; Oliveira, Vitor M ; Sanches-Neto, Flávio O ; Wilhelms, Renan Z ; Queiroz Júnior, Luiz H. K</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c337t-b773315f4928a2c2da8184d6c01fde9af9589ee4ebc559ee59e4963e600a3743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Chemical equilibrium</topic><topic>Decision making</topic><topic>Functional groups</topic><topic>Goodness of fit</topic><topic>Machine learning</topic><topic>NMR</topic><topic>Nuclear magnetic resonance</topic><topic>Search algorithms</topic><topic>Similarity</topic><topic>Spectra</topic><topic>Spectrometry</topic><topic>Spectroscopy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dias-Silva, Jefferson R</creatorcontrib><creatorcontrib>Oliveira, Vitor M</creatorcontrib><creatorcontrib>Sanches-Neto, Flávio O</creatorcontrib><creatorcontrib>Wilhelms, Renan Z</creatorcontrib><creatorcontrib>Queiroz Júnior, Luiz H. K</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>MEDLINE - Academic</collection><jtitle>Physical chemistry chemical physics : PCCP</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dias-Silva, Jefferson R</au><au>Oliveira, Vitor M</au><au>Sanches-Neto, Flávio O</au><au>Wilhelms, Renan Z</au><au>Queiroz Júnior, Luiz H. K</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications</atitle><jtitle>Physical chemistry chemical physics : PCCP</jtitle><addtitle>Phys Chem Chem Phys</addtitle><date>2023-07-12</date><risdate>2023</risdate><volume>25</volume><issue>27</issue><spage>1838</spage><epage>1847</epage><pages>1838-1847</pages><issn>1463-9076</issn><eissn>1463-9084</eissn><abstract>We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques. This descriptor is a fingerprint vector with defined sizes and values of 0 and 1, with the ability to correct chemical shift fluctuations. To explore the applicability of SpectraFP, we outlined two application scenarios: (1) the prediction of six functional groups by machine learning (ML) models and (2) the search for structures based on the similarity between the query spectrum and spectra in an experimental database, both in the SpectraFP format. For each functional group, five ML models were built and validated following the OECD principles: internal and external validations, applicability domains, and mechanistic interpretations. All the models resulted in high goodness-of-fit for the training and test sets with MCC respectively between 0.626 and 0.909 and 0.653 and 0.917, and J ranging from 0.812 to 0.957 and 0.825 to 0.961. Using the SHAP (SHapley Additive exPlanations) approach, the mechanistic interpretations of the models were explored; the results indicated that the most important variables for model decision making were coherent with the expected chemical shifts for each functional group. Several metrics, including Tanimoto, geometric, arithmetic, and Tversky, can be used to perform the similarity calculation for the search algorithm. This algorithm can also incorporate additional variables, such as the correction parameter and the difference between the amount of signals in the query spectrum and the database spectra, while preserving its high performance speed. We hope that our descriptor can link information from spectroscopic/spectrometric techniques with ML models to expand the possibilities in understanding the field of cheminformatics. All databases and algorithms developed for this work are open sources and freely accessible. We have developed an algorithm to generate a new spectra-based descriptor, called SpectraFP, in order to digitalize the chemical shifts of 13 C NMR spectra, as well as potentially important data from other spectroscopic techniques.</abstract><cop>England</cop><pub>Royal Society of Chemistry</pub><pmid>37378661</pmid><doi>10.1039/d3cp00734k</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-0399-4915</orcidid><orcidid>https://orcid.org/0000-0003-1706-5751</orcidid><orcidid>https://orcid.org/0000-0002-0664-171X</orcidid><orcidid>https://orcid.org/0000-0001-9049-1533</orcidid><orcidid>https://orcid.org/0000-0003-2627-6067</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1463-9076
ispartof Physical chemistry chemical physics : PCCP, 2023-07, Vol.25 (27), p.1838-1847
issn 1463-9076
1463-9084
language eng
recordid cdi_crossref_primary_10_1039_D3CP00734K
source Royal Society Of Chemistry Journals; Alma/SFX Local Collection
subjects Algorithms
Chemical equilibrium
Decision making
Functional groups
Goodness of fit
Machine learning
NMR
Nuclear magnetic resonance
Search algorithms
Similarity
Spectra
Spectrometry
Spectroscopy
title SpectraFP: a new spectra-based descriptor to aid in cheminformatics, molecular characterization and search algorithm applications
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T01%3A27%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SpectraFP:%20a%20new%20spectra-based%20descriptor%20to%20aid%20in%20cheminformatics,%20molecular%20characterization%20and%20search%20algorithm%20applications&rft.jtitle=Physical%20chemistry%20chemical%20physics%20:%20PCCP&rft.au=Dias-Silva,%20Jefferson%20R&rft.date=2023-07-12&rft.volume=25&rft.issue=27&rft.spage=1838&rft.epage=1847&rft.pages=1838-1847&rft.issn=1463-9076&rft.eissn=1463-9084&rft_id=info:doi/10.1039/d3cp00734k&rft_dat=%3Cproquest_cross%3E2830667662%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2835820135&rft_id=info:pmid/37378661&rfr_iscdi=true