Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing

Virtual screening (VS) has become a preferred tool to augment high-throughput screening and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of chemical information and modeling 2006-01, Vol.46 (1), p.321-333
Hauptverfasser:	Dutta, Debojyoti, Guha, Rajarshi, Jurs, Peter C, Chen, Ting
Format:	Artikel
Sprache:	eng
Schlagworte:	Biodiversity Chemicals Data mining Genetic algorithms Geometry Nonprescription drugs
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	333
container_issue	1
container_start_page	321
container_title	Journal of chemical information and modeling
container_volume	46
creator	Dutta, Debojyoti Guha, Rajarshi Jurs, Peter C Chen, Ting
description	Virtual screening (VS) has become a preferred tool to augment high-throughput screening and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.
doi_str_mv	10.1021/ci050403o
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_70700830</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>70700830</sourcerecordid><originalsourceid>FETCH-LOGICAL-a378t-eab5661b37ae13e41abc9ed2a3efe7bd31be66059bbd0d98a5ba39cbacbab9183</originalsourceid><addsrcrecordid>eNpl0NFKKzEQBuAginrUC19AFsED52I12WyyzaUUq4JoaRW8C5PsVKO7m5psQd_-pLQqKAQSZj4mw0_IIaOnjBbszDoqaEm53yC7TJQqV5I-bn6-hZI75E-ML5RyrmSxTXaYLAtJZbVLxlMLDZgGszGE3vXOd657yqCrs4v3eeMDLEuZn2XDZ2xdwtl0DhZj9hCX8BJ9i31wNruC-Jwq-2RrBk3Eg_W9Rx5GF_fDq_zm7vJ6eH6TA68GfY5ghJTM8AqQcSwZGKuwLoDjDCtTc2ZQSiqUMTWt1QCEAa6sgXSMYgO-R_6u5s6Df1tg7HXrosWmgQ79IuqKVpQOOE3w-Ad88YvQpd10wWRRlkKUCf1bIRt8jAFneh5cC-FDM6qXGeuvjJM9Wg9cmBbrb7kONYF8BVzs8f2rD-FVp24l9P14qiUfTcrR5FZPkj9ZebDxe7nfH_8H08WSoA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>216244554</pqid></control><display><type>article</type><title>Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing</title><source>ACS Publications</source><creator>Dutta, Debojyoti ; Guha, Rajarshi ; Jurs, Peter C ; Chen, Ting</creator><creatorcontrib>Dutta, Debojyoti ; Guha, Rajarshi ; Jurs, Peter C ; Chen, Ting</creatorcontrib><description>Virtual screening (VS) has become a preferred tool to augment high-throughput screening and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/ci050403o</identifier><identifier>PMID: 16426067</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Biodiversity ; Chemicals ; Data mining ; Genetic algorithms ; Geometry ; Nonprescription drugs</subject><ispartof>Journal of chemical information and modeling, 2006-01, Vol.46 (1), p.321-333</ispartof><rights>Copyright © 2006 American Chemical Society</rights><rights>Copyright American Chemical Society Jan 2006</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a378t-eab5661b37ae13e41abc9ed2a3efe7bd31be66059bbd0d98a5ba39cbacbab9183</citedby><cites>FETCH-LOGICAL-a378t-eab5661b37ae13e41abc9ed2a3efe7bd31be66059bbd0d98a5ba39cbacbab9183</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/ci050403o$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/ci050403o$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,776,780,2752,27053,27901,27902,56713,56763</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/16426067$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dutta, Debojyoti</creatorcontrib><creatorcontrib>Guha, Rajarshi</creatorcontrib><creatorcontrib>Jurs, Peter C</creatorcontrib><creatorcontrib>Chen, Ting</creatorcontrib><title>Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Virtual screening (VS) has become a preferred tool to augment high-throughput screening and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.</description><subject>Biodiversity</subject><subject>Chemicals</subject><subject>Data mining</subject><subject>Genetic algorithms</subject><subject>Geometry</subject><subject>Nonprescription drugs</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2006</creationdate><recordtype>article</recordtype><recordid>eNpl0NFKKzEQBuAginrUC19AFsED52I12WyyzaUUq4JoaRW8C5PsVKO7m5psQd_-pLQqKAQSZj4mw0_IIaOnjBbszDoqaEm53yC7TJQqV5I-bn6-hZI75E-ML5RyrmSxTXaYLAtJZbVLxlMLDZgGszGE3vXOd657yqCrs4v3eeMDLEuZn2XDZ2xdwtl0DhZj9hCX8BJ9i31wNruC-Jwq-2RrBk3Eg_W9Rx5GF_fDq_zm7vJ6eH6TA68GfY5ghJTM8AqQcSwZGKuwLoDjDCtTc2ZQSiqUMTWt1QCEAa6sgXSMYgO-R_6u5s6Df1tg7HXrosWmgQ79IuqKVpQOOE3w-Ad88YvQpd10wWRRlkKUCf1bIRt8jAFneh5cC-FDM6qXGeuvjJM9Wg9cmBbrb7kONYF8BVzs8f2rD-FVp24l9P14qiUfTcrR5FZPkj9ZebDxe7nfH_8H08WSoA</recordid><startdate>20060101</startdate><enddate>20060101</enddate><creator>Dutta, Debojyoti</creator><creator>Guha, Rajarshi</creator><creator>Jurs, Peter C</creator><creator>Chen, Ting</creator><general>American Chemical Society</general><scope>BSCLL</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope></search><sort><creationdate>20060101</creationdate><title>Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing</title><author>Dutta, Debojyoti ; Guha, Rajarshi ; Jurs, Peter C ; Chen, Ting</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a378t-eab5661b37ae13e41abc9ed2a3efe7bd31be66059bbd0d98a5ba39cbacbab9183</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2006</creationdate><topic>Biodiversity</topic><topic>Chemicals</topic><topic>Data mining</topic><topic>Genetic algorithms</topic><topic>Geometry</topic><topic>Nonprescription drugs</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dutta, Debojyoti</creatorcontrib><creatorcontrib>Guha, Rajarshi</creatorcontrib><creatorcontrib>Jurs, Peter C</creatorcontrib><creatorcontrib>Chen, Ting</creatorcontrib><collection>Istex</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dutta, Debojyoti</au><au>Guha, Rajarshi</au><au>Jurs, Peter C</au><au>Chen, Ting</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2006-01-01</date><risdate>2006</risdate><volume>46</volume><issue>1</issue><spage>321</spage><epage>333</epage><pages>321-333</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>Virtual screening (VS) has become a preferred tool to augment high-throughput screening and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>16426067</pmid><doi>10.1021/ci050403o</doi><tpages>13</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1549-9596
ispartof	Journal of chemical information and modeling, 2006-01, Vol.46 (1), p.321-333
issn	1549-9596 1549-960X
language	eng
recordid	cdi_proquest_miscellaneous_70700830
source	ACS Publications
subjects	Biodiversity Chemicals Data mining Genetic algorithms Geometry Nonprescription drugs
title	Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T18%3A38%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20Partitioning%20and%20Exploration%20of%20Chemical%20Spaces%20Using%20Geometric%20Hashing&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Dutta,%20Debojyoti&rft.date=2006-01-01&rft.volume=46&rft.issue=1&rft.spage=321&rft.epage=333&rft.pages=321-333&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/ci050403o&rft_dat=%3Cproquest_cross%3E70700830%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=216244554&rft_id=info:pmid/16426067&rfr_iscdi=true