How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space

Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Eu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of Chemical Information and Modeling 2009-01, Vol.49 (1), p.108-119
Hauptverfasser: Bender, Andreas, Jenkins, Jeremy L, Scheiber, Josef, Sukuru, Sai Chetan K, Glick, Meir, Davies, John W
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 119
container_issue 1
container_start_page 108
container_title Journal of Chemical Information and Modeling
container_volume 49
creator Bender, Andreas
Jenkins, Jeremy L
Scheiber, Josef
Sukuru, Sai Chetan K
Glick, Meir
Davies, John W
description Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.
doi_str_mv 10.1021/ci800249s
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_66853810</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>66853810</sourcerecordid><originalsourceid>FETCH-LOGICAL-a340t-182644ac0b2b437858e6a7ea5cb980edfc4842c6fbd26606e6a2a5857b33428e3</originalsourceid><addsrcrecordid>eNplkF1LwzAUhoMobk4v_AMSBAUvqvlqll7JmB8TNhSm1yXNTl1G29SkRfbv7dhU0Kvzwnl4D-dB6JSSa0oYvTFWEcJEEvZQn8aMRDEVbH-TRRIlcSJ76CiEFSGcJ5Idoh5NKOMJE320mrhPPLelLbTHIw_f2TZrPAftzdJW73gGzdItwi0e4RdvK2NrXeCxK2tXQdXgUaWLdbABuxzPXAGm3bTdQTDe1o3zeF5rA8foINdFgJPdHKC3h_vX8SSaPj8-jUfTSHNBmogqJoXQhmQsE3yoYgVSD0HHJksUgUVuhBLMyDxbMCmJ7LZMxyoeZpwLpoAP0OW2t_buo4XQpKUNBopCV-DakEqpYq4o6cDzP-DKtb77JaSMyk4QIaqDrraQ8S4ED3lae1tqv04pSTf20x_7HXu2K2yzEha_5E53B1xsAW3C77H_RV8f3Yrc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>216239008</pqid></control><display><type>article</type><title>How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space</title><source>ACS Publications</source><source>MEDLINE</source><creator>Bender, Andreas ; Jenkins, Jeremy L ; Scheiber, Josef ; Sukuru, Sai Chetan K ; Glick, Meir ; Davies, John W</creator><creatorcontrib>Bender, Andreas ; Jenkins, Jeremy L ; Scheiber, Josef ; Sukuru, Sai Chetan K ; Glick, Meir ; Davies, John W</creatorcontrib><description>Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1520-5142</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/ci800249s</identifier><identifier>PMID: 19123924</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Analytical chemistry ; Chemical compounds ; Databases, Factual ; Drug Evaluation, Preclinical ; Euclidean space ; Informatics ; Molecular Structure ; Pharmaceutical Modeling ; Principal Component Analysis ; Principal components analysis ; User-Computer Interface</subject><ispartof>Journal of Chemical Information and Modeling, 2009-01, Vol.49 (1), p.108-119</ispartof><rights>Copyright © 2009 American Chemical Society</rights><rights>Copyright American Chemical Society Jan 26, 2009</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a340t-182644ac0b2b437858e6a7ea5cb980edfc4842c6fbd26606e6a2a5857b33428e3</citedby><cites>FETCH-LOGICAL-a340t-182644ac0b2b437858e6a7ea5cb980edfc4842c6fbd26606e6a2a5857b33428e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/ci800249s$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/ci800249s$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,776,780,2752,27053,27901,27902,56713,56763</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/19123924$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Bender, Andreas</creatorcontrib><creatorcontrib>Jenkins, Jeremy L</creatorcontrib><creatorcontrib>Scheiber, Josef</creatorcontrib><creatorcontrib>Sukuru, Sai Chetan K</creatorcontrib><creatorcontrib>Glick, Meir</creatorcontrib><creatorcontrib>Davies, John W</creatorcontrib><title>How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space</title><title>Journal of Chemical Information and Modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.</description><subject>Analytical chemistry</subject><subject>Chemical compounds</subject><subject>Databases, Factual</subject><subject>Drug Evaluation, Preclinical</subject><subject>Euclidean space</subject><subject>Informatics</subject><subject>Molecular Structure</subject><subject>Pharmaceutical Modeling</subject><subject>Principal Component Analysis</subject><subject>Principal components analysis</subject><subject>User-Computer Interface</subject><issn>1549-9596</issn><issn>1520-5142</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNplkF1LwzAUhoMobk4v_AMSBAUvqvlqll7JmB8TNhSm1yXNTl1G29SkRfbv7dhU0Kvzwnl4D-dB6JSSa0oYvTFWEcJEEvZQn8aMRDEVbH-TRRIlcSJ76CiEFSGcJ5Idoh5NKOMJE320mrhPPLelLbTHIw_f2TZrPAftzdJW73gGzdItwi0e4RdvK2NrXeCxK2tXQdXgUaWLdbABuxzPXAGm3bTdQTDe1o3zeF5rA8foINdFgJPdHKC3h_vX8SSaPj8-jUfTSHNBmogqJoXQhmQsE3yoYgVSD0HHJksUgUVuhBLMyDxbMCmJ7LZMxyoeZpwLpoAP0OW2t_buo4XQpKUNBopCV-DakEqpYq4o6cDzP-DKtb77JaSMyk4QIaqDrraQ8S4ED3lae1tqv04pSTf20x_7HXu2K2yzEha_5E53B1xsAW3C77H_RV8f3Yrc</recordid><startdate>20090101</startdate><enddate>20090101</enddate><creator>Bender, Andreas</creator><creator>Jenkins, Jeremy L</creator><creator>Scheiber, Josef</creator><creator>Sukuru, Sai Chetan K</creator><creator>Glick, Meir</creator><creator>Davies, John W</creator><general>American Chemical Society</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope></search><sort><creationdate>20090101</creationdate><title>How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space</title><author>Bender, Andreas ; Jenkins, Jeremy L ; Scheiber, Josef ; Sukuru, Sai Chetan K ; Glick, Meir ; Davies, John W</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a340t-182644ac0b2b437858e6a7ea5cb980edfc4842c6fbd26606e6a2a5857b33428e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Analytical chemistry</topic><topic>Chemical compounds</topic><topic>Databases, Factual</topic><topic>Drug Evaluation, Preclinical</topic><topic>Euclidean space</topic><topic>Informatics</topic><topic>Molecular Structure</topic><topic>Pharmaceutical Modeling</topic><topic>Principal Component Analysis</topic><topic>Principal components analysis</topic><topic>User-Computer Interface</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bender, Andreas</creatorcontrib><creatorcontrib>Jenkins, Jeremy L</creatorcontrib><creatorcontrib>Scheiber, Josef</creatorcontrib><creatorcontrib>Sukuru, Sai Chetan K</creatorcontrib><creatorcontrib>Glick, Meir</creatorcontrib><creatorcontrib>Davies, John W</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of Chemical Information and Modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bender, Andreas</au><au>Jenkins, Jeremy L</au><au>Scheiber, Josef</au><au>Sukuru, Sai Chetan K</au><au>Glick, Meir</au><au>Davies, John W</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space</atitle><jtitle>Journal of Chemical Information and Modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2009-01-01</date><risdate>2009</risdate><volume>49</volume><issue>1</issue><spage>108</spage><epage>119</epage><pages>108-119</pages><issn>1549-9596</issn><eissn>1520-5142</eissn><eissn>1549-960X</eissn><abstract>Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>19123924</pmid><doi>10.1021/ci800249s</doi><tpages>12</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1549-9596
ispartof Journal of Chemical Information and Modeling, 2009-01, Vol.49 (1), p.108-119
issn 1549-9596
1520-5142
1549-960X
language eng
recordid cdi_proquest_miscellaneous_66853810
source ACS Publications; MEDLINE
subjects Analytical chemistry
Chemical compounds
Databases, Factual
Drug Evaluation, Preclinical
Euclidean space
Informatics
Molecular Structure
Pharmaceutical Modeling
Principal Component Analysis
Principal components analysis
User-Computer Interface
title How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T05%3A42%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=How%20Similar%20Are%20Similarity%20Searching%20Methods?%20A%20Principal%20Component%20Analysis%20of%20Molecular%20Descriptor%20Space&rft.jtitle=Journal%20of%20Chemical%20Information%20and%20Modeling&rft.au=Bender,%20Andreas&rft.date=2009-01-01&rft.volume=49&rft.issue=1&rft.spage=108&rft.epage=119&rft.pages=108-119&rft.issn=1549-9596&rft.eissn=1520-5142&rft_id=info:doi/10.1021/ci800249s&rft_dat=%3Cproquest_cross%3E66853810%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=216239008&rft_id=info:pmid/19123924&rfr_iscdi=true