MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning
High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hund...
Gespeichert in:
Veröffentlicht in: | Journal of chemical information and modeling 2023-05, Vol.63 (9), p.2667-2678 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2678 |
---|---|
container_issue | 9 |
container_start_page | 2667 |
container_title | Journal of chemical information and modeling |
container_volume | 63 |
creator | Buterez, David Janet, Jon Paul Kiddle, Steven J. Liò, Pietro |
description | High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and can potentially lead to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine-learning-ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of “noisy” activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce Multifidelity PubChem BioAssay (MF-PCBA), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call multifidelity. Multifidelity data accurately reflect real-world HTS conventions and present a new, challenging task for machine learning: the integration of low- and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here we detail the steps taken to assemble MF-PCBA in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent deep-learning-based method for multifidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule–protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba. |
doi_str_mv | 10.1021/acs.jcim.2c01569 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10170507</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2801978909</sourcerecordid><originalsourceid>FETCH-LOGICAL-a462t-528e0037d3cc716dc2f95cde3809f71dc5e7cae332b544dc66088bc6f40be88e3</originalsourceid><addsrcrecordid>eNp1kc1vEzEQxS0EoqXlzglZ4sKBDfZ6P2wuVZu2FClRK1EkbpYzO7vrsPEGe7dS_vs6JKkAqSdb8u89v5lHyDvOJpyl_LOBMFmCXU1SYDwv1AtyzPNMJapgP18e7rkqjsibEJaMCaGK9DU5EiXLZS7lMYH5dXI3vTj_QudjN9jaVtjZYUNvbNMm963vx6ZdjwP9Dh7RWdfQC3TQroz_FWjde3rpx4Ze2gD9A_oNNa6icwOtdUhnaPxWckpe1aYL-HZ_npAf11f305tkdvv12_R8lpisSIckTyXGiGUlAEpeVJDWKocKhWSqLnkFOZZgUIh0kWdZBUXBpFxAUWdsgVKiOCFnO9_1uFhhBegGbzq99jbG3ejeWP3vi7OtbvoHzRmPC2FldPi4d_D97xHDoFdxMuw647Afg04l46qUiqmIfvgPXfajd3G-SHGelhkTPFJsR4HvQ_BYP6XhTG8r1LFCva1Q7yuMkvd_T_EkOHQWgU874I_08Omzfo_l86i9</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2811274031</pqid></control><display><type>article</type><title>MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning</title><source>MEDLINE</source><source>American Chemical Society Publications</source><creator>Buterez, David ; Janet, Jon Paul ; Kiddle, Steven J. ; Liò, Pietro</creator><creatorcontrib>Buterez, David ; Janet, Jon Paul ; Kiddle, Steven J. ; Liò, Pietro</creatorcontrib><description>High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and can potentially lead to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine-learning-ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of “noisy” activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce Multifidelity PubChem BioAssay (MF-PCBA), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call multifidelity. Multifidelity data accurately reflect real-world HTS conventions and present a new, challenging task for machine learning: the integration of low- and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here we detail the steps taken to assemble MF-PCBA in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent deep-learning-based method for multifidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule–protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.2c01569</identifier><identifier>PMID: 37058588</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Benchmarking ; Biological Assay ; Data acquisition ; Datasets ; Deep learning ; Design of experiments ; Drug Discovery - methods ; High-Throughput Screening Assays - methods ; Machine Learning ; Machine Learning and Deep Learning ; Screening ; Source code</subject><ispartof>Journal of chemical information and modeling, 2023-05, Vol.63 (9), p.2667-2678</ispartof><rights>2023 The Authors. Published by American Chemical Society</rights><rights>Copyright American Chemical Society May 8, 2023</rights><rights>2023 The Authors. Published by American Chemical Society 2023 The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a462t-528e0037d3cc716dc2f95cde3809f71dc5e7cae332b544dc66088bc6f40be88e3</citedby><cites>FETCH-LOGICAL-a462t-528e0037d3cc716dc2f95cde3809f71dc5e7cae332b544dc66088bc6f40be88e3</cites><orcidid>0000-0001-6558-0833 ; 0000-0001-7825-4797</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.2c01569$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.2c01569$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>230,314,780,784,885,2765,27076,27924,27925,56738,56788</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37058588$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Buterez, David</creatorcontrib><creatorcontrib>Janet, Jon Paul</creatorcontrib><creatorcontrib>Kiddle, Steven J.</creatorcontrib><creatorcontrib>Liò, Pietro</creatorcontrib><title>MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and can potentially lead to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine-learning-ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of “noisy” activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce Multifidelity PubChem BioAssay (MF-PCBA), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call multifidelity. Multifidelity data accurately reflect real-world HTS conventions and present a new, challenging task for machine learning: the integration of low- and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here we detail the steps taken to assemble MF-PCBA in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent deep-learning-based method for multifidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule–protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba.</description><subject>Benchmarking</subject><subject>Biological Assay</subject><subject>Data acquisition</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Design of experiments</subject><subject>Drug Discovery - methods</subject><subject>High-Throughput Screening Assays - methods</subject><subject>Machine Learning</subject><subject>Machine Learning and Deep Learning</subject><subject>Screening</subject><subject>Source code</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNp1kc1vEzEQxS0EoqXlzglZ4sKBDfZ6P2wuVZu2FClRK1EkbpYzO7vrsPEGe7dS_vs6JKkAqSdb8u89v5lHyDvOJpyl_LOBMFmCXU1SYDwv1AtyzPNMJapgP18e7rkqjsibEJaMCaGK9DU5EiXLZS7lMYH5dXI3vTj_QudjN9jaVtjZYUNvbNMm963vx6ZdjwP9Dh7RWdfQC3TQroz_FWjde3rpx4Ze2gD9A_oNNa6icwOtdUhnaPxWckpe1aYL-HZ_npAf11f305tkdvv12_R8lpisSIckTyXGiGUlAEpeVJDWKocKhWSqLnkFOZZgUIh0kWdZBUXBpFxAUWdsgVKiOCFnO9_1uFhhBegGbzq99jbG3ejeWP3vi7OtbvoHzRmPC2FldPi4d_D97xHDoFdxMuw647Afg04l46qUiqmIfvgPXfajd3G-SHGelhkTPFJsR4HvQ_BYP6XhTG8r1LFCva1Q7yuMkvd_T_EkOHQWgU874I_08Omzfo_l86i9</recordid><startdate>20230508</startdate><enddate>20230508</enddate><creator>Buterez, David</creator><creator>Janet, Jon Paul</creator><creator>Kiddle, Steven J.</creator><creator>Liò, Pietro</creator><general>American Chemical Society</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-6558-0833</orcidid><orcidid>https://orcid.org/0000-0001-7825-4797</orcidid></search><sort><creationdate>20230508</creationdate><title>MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning</title><author>Buterez, David ; Janet, Jon Paul ; Kiddle, Steven J. ; Liò, Pietro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a462t-528e0037d3cc716dc2f95cde3809f71dc5e7cae332b544dc66088bc6f40be88e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Benchmarking</topic><topic>Biological Assay</topic><topic>Data acquisition</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Design of experiments</topic><topic>Drug Discovery - methods</topic><topic>High-Throughput Screening Assays - methods</topic><topic>Machine Learning</topic><topic>Machine Learning and Deep Learning</topic><topic>Screening</topic><topic>Source code</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Buterez, David</creatorcontrib><creatorcontrib>Janet, Jon Paul</creatorcontrib><creatorcontrib>Kiddle, Steven J.</creatorcontrib><creatorcontrib>Liò, Pietro</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Buterez, David</au><au>Janet, Jon Paul</au><au>Kiddle, Steven J.</au><au>Liò, Pietro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2023-05-08</date><risdate>2023</risdate><volume>63</volume><issue>9</issue><spage>2667</spage><epage>2678</epage><pages>2667-2678</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and can potentially lead to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine-learning-ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of “noisy” activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce Multifidelity PubChem BioAssay (MF-PCBA), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call multifidelity. Multifidelity data accurately reflect real-world HTS conventions and present a new, challenging task for machine learning: the integration of low- and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here we detail the steps taken to assemble MF-PCBA in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent deep-learning-based method for multifidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule–protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>37058588</pmid><doi>10.1021/acs.jcim.2c01569</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-6558-0833</orcidid><orcidid>https://orcid.org/0000-0001-7825-4797</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1549-9596 |
ispartof | Journal of chemical information and modeling, 2023-05, Vol.63 (9), p.2667-2678 |
issn | 1549-9596 1549-960X |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10170507 |
source | MEDLINE; American Chemical Society Publications |
subjects | Benchmarking Biological Assay Data acquisition Datasets Deep learning Design of experiments Drug Discovery - methods High-Throughput Screening Assays - methods Machine Learning Machine Learning and Deep Learning Screening Source code |
title | MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T11%3A16%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MF-PCBA:%20Multifidelity%20High-Throughput%20Screening%20Benchmarks%20for%20Drug%20Discovery%20and%20Machine%20Learning&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Buterez,%20David&rft.date=2023-05-08&rft.volume=63&rft.issue=9&rft.spage=2667&rft.epage=2678&rft.pages=2667-2678&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.2c01569&rft_dat=%3Cproquest_pubme%3E2801978909%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2811274031&rft_id=info:pmid/37058588&rfr_iscdi=true |