Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning
The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which che...
Gespeichert in:
Veröffentlicht in: | ACS ES&T engineering 2022-07, Vol.2 (7), p.1211-1220 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1220 |
---|---|
container_issue | 7 |
container_start_page | 1211 |
container_title | ACS ES&T engineering |
container_volume | 2 |
creator | Zhong, Shifa Lambeth, Dylan R Igou, Thomas K Chen, Yongsheng |
description | The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time. |
doi_str_mv | 10.1021/acsestengg.1c00434 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2718263114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2718263114</sourcerecordid><originalsourceid>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</originalsourceid><addsrcrecordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718263114</pqid></control><display><type>article</type><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><source>ACS Journals</source><creator>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creator><creatorcontrib>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creatorcontrib><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><identifier>ISSN: 2690-0645</identifier><identifier>EISSN: 2690-0645</identifier><identifier>DOI: 10.1021/acsestengg.1c00434</identifier><language>eng</language><publisher>American Chemical Society</publisher><subject>data collection ; normal distribution ; prediction ; quantitative structure-activity relationships ; solubility ; standard deviation ; uncertainty</subject><ispartof>ACS ES&T engineering, 2022-07, Vol.2 (7), p.1211-1220</ispartof><rights>2022 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</citedby><cites>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</cites><orcidid>0000-0002-5822-0837 ; 0000-0002-9519-2302</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acsestengg.1c00434$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acsestengg.1c00434$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,777,781,2752,27057,27905,27906,56719,56769</link.rule.ids></links><search><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><title>ACS ES&T engineering</title><addtitle>ACS EST Engg</addtitle><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><subject>data collection</subject><subject>normal distribution</subject><subject>prediction</subject><subject>quantitative structure-activity relationships</subject><subject>solubility</subject><subject>standard deviation</subject><subject>uncertainty</subject><issn>2690-0645</issn><issn>2690-0645</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</recordid><startdate>20220708</startdate><enddate>20220708</enddate><creator>Zhong, Shifa</creator><creator>Lambeth, Dylan R</creator><creator>Igou, Thomas K</creator><creator>Chen, Yongsheng</creator><general>American Chemical Society</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7S9</scope><scope>L.6</scope><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></search><sort><creationdate>20220708</creationdate><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><author>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>data collection</topic><topic>normal distribution</topic><topic>prediction</topic><topic>quantitative structure-activity relationships</topic><topic>solubility</topic><topic>standard deviation</topic><topic>uncertainty</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><collection>CrossRef</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>ACS ES&T engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhong, Shifa</au><au>Lambeth, Dylan R</au><au>Igou, Thomas K</au><au>Chen, Yongsheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</atitle><jtitle>ACS ES&T engineering</jtitle><addtitle>ACS EST Engg</addtitle><date>2022-07-08</date><risdate>2022</risdate><volume>2</volume><issue>7</issue><spage>1211</spage><epage>1220</epage><pages>1211-1220</pages><issn>2690-0645</issn><eissn>2690-0645</eissn><abstract>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</abstract><pub>American Chemical Society</pub><doi>10.1021/acsestengg.1c00434</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2690-0645 |
ispartof | ACS ES&T engineering, 2022-07, Vol.2 (7), p.1211-1220 |
issn | 2690-0645 2690-0645 |
language | eng |
recordid | cdi_proquest_miscellaneous_2718263114 |
source | ACS Journals |
subjects | data collection normal distribution prediction quantitative structure-activity relationships solubility standard deviation uncertainty |
title | Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T17%3A15%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enlarging%20Applicability%20Domain%20of%20Quantitative%20Structure%E2%80%93Activity%20Relationship%20Models%20through%20Uncertainty-Based%20Active%20Learning&rft.jtitle=ACS%20ES&T%20engineering&rft.au=Zhong,%20Shifa&rft.date=2022-07-08&rft.volume=2&rft.issue=7&rft.spage=1211&rft.epage=1220&rft.pages=1211-1220&rft.issn=2690-0645&rft.eissn=2690-0645&rft_id=info:doi/10.1021/acsestengg.1c00434&rft_dat=%3Cproquest_cross%3E2718263114%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718263114&rft_id=info:pmid/&rfr_iscdi=true |