Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning

The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which che...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACS ES&T engineering 2022-07, Vol.2 (7), p.1211-1220
Hauptverfasser: Zhong, Shifa, Lambeth, Dylan R, Igou, Thomas K, Chen, Yongsheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1220
container_issue 7
container_start_page 1211
container_title ACS ES&T engineering
container_volume 2
creator Zhong, Shifa
Lambeth, Dylan R
Igou, Thomas K
Chen, Yongsheng
description The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.
doi_str_mv 10.1021/acsestengg.1c00434
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2718263114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2718263114</sourcerecordid><originalsourceid>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</originalsourceid><addsrcrecordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718263114</pqid></control><display><type>article</type><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><source>ACS Journals</source><creator>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creator><creatorcontrib>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creatorcontrib><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (&gt;300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><identifier>ISSN: 2690-0645</identifier><identifier>EISSN: 2690-0645</identifier><identifier>DOI: 10.1021/acsestengg.1c00434</identifier><language>eng</language><publisher>American Chemical Society</publisher><subject>data collection ; normal distribution ; prediction ; quantitative structure-activity relationships ; solubility ; standard deviation ; uncertainty</subject><ispartof>ACS ES&amp;T engineering, 2022-07, Vol.2 (7), p.1211-1220</ispartof><rights>2022 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</citedby><cites>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</cites><orcidid>0000-0002-5822-0837 ; 0000-0002-9519-2302</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acsestengg.1c00434$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acsestengg.1c00434$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,777,781,2752,27057,27905,27906,56719,56769</link.rule.ids></links><search><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><title>ACS ES&amp;T engineering</title><addtitle>ACS EST Engg</addtitle><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (&gt;300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><subject>data collection</subject><subject>normal distribution</subject><subject>prediction</subject><subject>quantitative structure-activity relationships</subject><subject>solubility</subject><subject>standard deviation</subject><subject>uncertainty</subject><issn>2690-0645</issn><issn>2690-0645</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</recordid><startdate>20220708</startdate><enddate>20220708</enddate><creator>Zhong, Shifa</creator><creator>Lambeth, Dylan R</creator><creator>Igou, Thomas K</creator><creator>Chen, Yongsheng</creator><general>American Chemical Society</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7S9</scope><scope>L.6</scope><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></search><sort><creationdate>20220708</creationdate><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><author>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>data collection</topic><topic>normal distribution</topic><topic>prediction</topic><topic>quantitative structure-activity relationships</topic><topic>solubility</topic><topic>standard deviation</topic><topic>uncertainty</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><collection>CrossRef</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>ACS ES&amp;T engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhong, Shifa</au><au>Lambeth, Dylan R</au><au>Igou, Thomas K</au><au>Chen, Yongsheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</atitle><jtitle>ACS ES&amp;T engineering</jtitle><addtitle>ACS EST Engg</addtitle><date>2022-07-08</date><risdate>2022</risdate><volume>2</volume><issue>7</issue><spage>1211</spage><epage>1220</epage><pages>1211-1220</pages><issn>2690-0645</issn><eissn>2690-0645</eissn><abstract>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (&gt;300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</abstract><pub>American Chemical Society</pub><doi>10.1021/acsestengg.1c00434</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 2690-0645
ispartof ACS ES&T engineering, 2022-07, Vol.2 (7), p.1211-1220
issn 2690-0645
2690-0645
language eng
recordid cdi_proquest_miscellaneous_2718263114
source ACS Journals
subjects data collection
normal distribution
prediction
quantitative structure-activity relationships
solubility
standard deviation
uncertainty
title Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T17%3A15%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enlarging%20Applicability%20Domain%20of%20Quantitative%20Structure%E2%80%93Activity%20Relationship%20Models%20through%20Uncertainty-Based%20Active%20Learning&rft.jtitle=ACS%20ES&T%20engineering&rft.au=Zhong,%20Shifa&rft.date=2022-07-08&rft.volume=2&rft.issue=7&rft.spage=1211&rft.epage=1220&rft.pages=1211-1220&rft.issn=2690-0645&rft.eissn=2690-0645&rft_id=info:doi/10.1021/acsestengg.1c00434&rft_dat=%3Cproquest_cross%3E2718263114%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718263114&rft_id=info:pmid/&rfr_iscdi=true